Introduction to protein alignments

Similar documents
Practical Bioinformatics

Advanced topics in bioinformatics

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Basic Local Alignment Search Tool

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Tools and Algorithms in Bioinformatics

Genomics and bioinformatics summary. Finding genes -- computer searches

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Hands-On Nine The PAX6 Gene and Protein

Week 10: Homology Modelling (II) - HHpred

Sequence Database Search Techniques I: Blast and PatternHunter tools

In-Depth Assessment of Local Sequence Alignment

Bioinformatics and BLAST

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

BLAST. Varieties of BLAST

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Practical considerations of working with sequencing data

Alignment & BLAST. By: Hadi Mozafari KUMS

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Pairwise & Multiple sequence alignments

Computational Biology

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Supplementary Information for

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence analysis and Genomics

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Tools and Algorithms in Bioinformatics

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Procedure to Create NCBI KOGS

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Pairwise sequence alignment

Tutorial 4 Substitution matrices and PSI-BLAST

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics Exercises

Quantifying sequence similarity

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

An Introduction to Sequence Similarity ( Homology ) Searching

Multiple Sequence Alignments

Large-Scale Genomic Surveys

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Bioinformatics for Biologists

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Sequence Alignment Techniques and Their Uses

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

Ch. 9 Multiple Sequence Alignment (MSA)

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Algorithms in Bioinformatics

Chapter 7: Rapid alignment methods: FASTA and BLAST

Sequence analysis and comparison

Homology and Information Gathering and Domain Annotation for Proteins

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Similarity searching summary (2)

NUCLEOTIDE SUBSTITUTIONS AND THE EVOLUTION OF DUPLICATE GENES

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Graph Alignment and Biological Networks

Example of Function Prediction

Introduction to Bioinformatics

Comparative Genomics II

Bioinformatics 1 lecture 13. Database searches. Profiles Orthologs/paralogs Tree of Life term projects

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Multiple Sequence Alignment. Sequences

G4120: Introduction to Computational Biology

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Single alignment: Substitution Matrix. 16 march 2017

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Scoring Matrices. Shifra Ben-Dor Irit Orr

Protein function prediction based on sequence analysis

BIOINFORMATICS: An Introduction

Introduction to Comparative Protein Modeling. Chapter 4 Part I

BLAST: Target frequencies and information content Dannie Durand

Pairwise sequence alignments

Homology. and. Information Gathering and Domain Annotation for Proteins

Chapter 26: Phylogeny and the Tree of Life

Advanced Certificate in Principles in Protein Structure. You will be given a start time with your exam instructions

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Exploring Evolution & Bioinformatics

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

8/23/2014. Phylogeny and the Tree of Life

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins

Exercise 5. Sequence Profiles & BLAST

Session 5: Phylogenomics

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Biol478/ August

bioinformatics 1 -- lecture 7

Computational Biology: Basics & Interesting Problems

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

Transcription:

Introduction to protein alignments

Comparative Analysis of Proteins Experimental evidence from one or more proteins can be used to infer function of related protein(s). Gene A Gene X Protein A compare Protein X experiment? FUNCTION

Comparative Analysis of Proteins The more experimental evidence you have, the more likely your inference is correct. Genes A, B, C, D Gene X Proteins A, B, C, D compare Protein X experiments? FUNCTION

What do we compare? Primary structure (1 o ) The sequence of amino acids. Sequence alignments Each amino acid has unique chemical properties. Look at The 20 Amino Acids Found in Proteins

What do we compare? Secondary structure (2 o ) 3D structure of localized regions in a protein. Alpha-helix and beta-sheet

What do we compare? Tertiary structure (3 o ) 3D structure of entire protein Determined by amino acid sequence

What do we compare? N-terminal Central a3-helix Superimposition of PTP1B (magenta), RPTPa (gray), RPTPµ (red), LAR (blue), SHP1 (green) and SHP2 (yellow). http;//ptp.cshl.edu & http://science.novonordisk.com/ptp Andersen et al Mol. Cell. Biol. 2001

Some proteins are made of modular domains Ig FN III FN III FN III MAM P MAM Ig FN III

Model organisms Species Domain Kingdom Phylum E. coli Bacteria Yeast Eukaryota Fungi Dictyostelium Eukaryota Amoebozoa Arabidopsis Eukaryota Plantae C. elegans Eukaryota Animalia Nematoda D. melanogaster (fruit fly) D. raniro (zebrafish) M. musculus (mouse) Eukaryota Animalia Arthropoda Eukaryota Animalia Chordata Eukaryota Animalia Chordata H. sapiens Eukaryota Animalia Chordata

Evolution of a Sequence (review) red aa s: essential to protein function atg gcg gtg cgc att gaa acc ggc tat gaa ctg atg --- --- --- +-- --- --- -+- --- --- --+ --- --- M A V R I E T G Y E L M atg gcg gtg cgc att gaa Gcc ggc tat gaa ctg atg --- --- --- +-- --- --- -+- --- --- --+ --- --- M A V R I E A G Y E L M atg gcg gtg cgc att gaa Gcc ggc ttt gaa ctg agg --- --- --- +-- --- --- -+- --- --- --+ --- --- M A V R I E A G F E L R atg gcg gtg ctc att gaa Gcc ggc ttt gaa ctg agg --- --- --- +-- --- --- -+- --- --- --+ --- --- M A V L I E A G F E L R atg gcg gtg cgc att gca Gcc ggc ttt gaa ctg agg --- --- --- +-- --- --- -+- --- --- --+ --- --- M A V R I A A G F E L R

Evolution of a Sequence MAVLIESGFELR MAVRIEAGYELM

Homology Homology: features in different individuals that are descended from the same feature in a common ancestor. from Baxenvanis, A. D. and Ouelette, B. F. F., Bioinformatics 3 rd Ed., 2005 For example: human arm and bird wing are homologues. Analogous features: similar features in different individuals that result from natural selection. For example: body shape of dolphins and tuna.

Homology Homologous sequences: Sequences that are similar because they are descended from a common ancestor. Homologous structures: Structures that are similar because they are descended from a common ancestor. May not have high degree of sequence similarity.

Homologs orthologs: Result of speciation or split of one species into two. Similar function in different species. paralogs: produced by gene duplication in an organism. One species can have multiple paralogs. Related, but not identical, function - e.g. different proteases. Many proteases in one species all break peptide bonds, but each has different target amino acids or proteins.

Origins of Homologs A Gene duplication in one species A B Speciation: creation of new species B 1 B 2 A 1 A 2 A, A 1, and A 2 are orthologs A and B are paralogs A 1 and B 1 are paralogs A and B 2 are paralogs

Sequence alignments Overview What positions in two sequences are equivalent Pairwise alignments two sequences aligned Multiple sequence alignment (MSA) Local and Global alignments

What information can we get from an alignment? Are genes or proteins similar? infer similar function Look for presence of functional residues active sites in enzymes is your protein functional?

Local and Global Alignments Local: finds the most similar regions of two sequence Global: compares two sequences along their total length Local alignments help align modular proteins or genes. Choose the method that gives the best alignment score.

Local and Global Alignments A B local high scores global low score

How is Alignment Made? VRETERI VRATERI VRETERI VRATERI VRETERI VEITGEIST?

How is Alignment Made? VRETERI VEITGEIST 1. Two sequences are aligned 2. A score is given to the alignment 3. All possible alignments are made and scores assigned 4. Alignment with the best possible score is selected.

Scoring an Alignment Simple scoring system: Match = 1 Mismatch = 0 or -1 Gap = -1 Align and score these two sequences VRETERI VEITGEIST

Scoring an Alignment Simple scoring system: Match = 1 Mismatch = 0 or -1 Gap = -1 VRETERI VEITGEIST 100100100 = 3 Another alignment more identities, but many more gaps. The score is low VRE-T-ERI V-EITGE-IST 1111111110 = 1

Advanced Scoring Systems Scoring Matrices: PAM and BLOSUM Some substitutions in sequence can conserve the function of the sequence position. Assign different scores to each substitution not just 0 or 1 Assign score for opening a gap and extending a gap example (BLOSUM 62): E aligned with E = 4 E aligned with D = 2 E aligned with I = -3

Advanced Scoring Systems Scoring Matrices: PAM and BLOSUM Factors taken into account with scoring matrices: Conservation of residue function What residues can substitute for another Charge, size, hydrophobicity o amino acid properties chart Frequency of residue in all proteins Evolutionary patterns

Choosing a scoring matrix BLOSUM30 Long alignments of divergent sequences, <30% BLOSUM62 Most effective at finding potential similarities, 30-40% BLOSUM80 Detecting members of a conserved protein family, 50-60% BLOSUM90 Short alignments of highly similar sequences, 70-90% PAM250 Long alignments of divergent sequences, <30% PAM160 Detecting members of a conserved protein family, 50-60% PAM40 Short alignments of highly similar sequences, 70-90% Taken from Bioinformatics, A Practical Guide... 3 rd Ed., Baxevanis, A.D. and Ouellette, B. F. F., 2005, pg. 303

fasta format file ending.fa or.fasta) Why use fasta? Simple, portable, widely used, easy to edit. Starts with > ; then, first line is sequence information. The following line starts the sequence (need carriage return). Be careful of format make sure you have it all

BLAST -Proteins Use BLASTP to find homologs: 1. Determine function of a novel protein. 2. Find members of gene or protein family. 3. Find a candidate protein structure.

How BLAST works Given the sequence: KRPFIETAERLRDQHKKDYPEYKYQPRRR BLAST searches the database for the three-letter query words starting at each letter of the sequence. The word size can be changed in the parameters section. Examples of 3-letter words: KRPFIETAERLRDQHKKDYPEYKYQPRRR KRP IET... à search database RPF ETA... à search database PFI TAE... à search database FIE AER... à search database

How BLAST works The program searches for the exact match as well as three letter words containing conservative substitutions. Scores for each three-letter word are determined by a scoring matrix (i.e. BLOSUM62). For example, when the query word is RDQ, KRPFIETAERLRDQHKKDYPEYKYQPRRR The score for this match is 16 look at your BLOSUM62 matrix Note: Word size can be set in BLAST

How BLAST works Some possible three-letter words that match RDQ with their scores are: RDQ 16 QDQ 12 NDQ 11 RDN 11 EDQ 11 SDQ 10 KDQ 13 REQ 12 HDQ 11 RDD 11 MDQ 10 RDP 10 RDE 13 RDR 12 RNQ 11 RDH 11 ADQ 10 RDT 10 Similar tables are generated for each three-letter word in the query. Each of these possible words are used to extend the alignment or neighborhood around the first alignment.

How BLAST works A three-letter word that aligns with RDQ is found in a sequence in the database. query database KRPFIETAERLRDQHKKDYPEYKYQPRRR R+Q KRPFVEGAERLREQHKKDHPEYKYQPRRR Extension of the neighborhood: query KRPFIETAERLRDQHKKDYPEYKYQPRRR R+Q database QHK KRPFVEGAERLREQHKKDHPEYKYQPRRR Neighborhood cutoff set to 11 not adjustable in web version?

How BLAST works The alignment is then extended from this high-scoring segment pair (RDQ/REQ) in both directions and a score is calculated. Length of extension is determined by the score of the resulting alignment. query subject KRPFIETAERLRDQHKKDYPEYKYQPRRR KRPF+E AERLR+QHKKD+P+YKYQPRRR KRPFVEGAERLREQHKKDHPEYKYQPRRR The alignment is called a high-scoring alignment pair (HSP). These are possible alignments. More than one HSP can be generated for each query-subject pair.

How BLAST works The alignment is then extended from this high-scoring segment pair (RDQ/REQ) in both directions and a score is calculated. Length of extension is determined by the score of the resulting alignment. Extension stops when the score can t be improved by including more sequence. The resulting alignment is called a high-scoring segment pair (HSP). query subject KRPFIETAERLRDQHKKDYPEYKYQPRRR KRPF+E AERLR+QHKKD+P+YKYQPRRR KRPFVEGAERLREQHKKDHPEYKYQPRRR

How BLAST works The score of the HSP is used to calculate the E-value. The HSP is reported as a hit if the E-value is below the specified cutoff.

BLAST results What are your looking for? What did you get? Homologs - orthologs or paralogs Conserved domains in a new protein architecture paralog?

BLAST results Ø If you have a lot of sequence in your database you might expect to find any sequence pattern if you look hard enough. The likelihood of this happening is greater if the database is large and your query is small. Think about searching the nr with a five or ten amino acid segment. ELVIS in the genome!

BLAST results 1. E-value Number of instances where the match would occur by chance. Calculated from the length of the database, length of the query and the score of the HSP. No hard and fast rules for E-value: Lower E-values indicates significant hits. An E-value close to 0 indicates an identical match. E-value should be below 0.0001.

BLAST results 2. Bit score Measure of the significance of the alignment. 3. Percent identity 4. Length of alignment the longer the better 5. Beyond BLAST results is the hit a homolog?