Introduction to protein alignments

Comparative Analysis of Proteins Experimental evidence from one or more proteins can be used to infer function of related protein(s). Gene A Gene X Protein A compare Protein X experiment? FUNCTION

Comparative Analysis of Proteins The more experimental evidence you have, the more likely your inference is correct. Genes A, B, C, D Gene X Proteins A, B, C, D compare Protein X experiments? FUNCTION

What do we compare? Primary structure (1 o ) The sequence of amino acids. Sequence alignments Each amino acid has unique chemical properties. Look at The 20 Amino Acids Found in Proteins

What do we compare? Secondary structure (2 o ) 3D structure of localized regions in a protein. Alpha-helix and beta-sheet

What do we compare? Tertiary structure (3 o ) 3D structure of entire protein Determined by amino acid sequence

What do we compare? N-terminal Central a3-helix Superimposition of PTP1B (magenta), RPTPa (gray), RPTPµ (red), LAR (blue), SHP1 (green) and SHP2 (yellow). http;//ptp.cshl.edu & http://science.novonordisk.com/ptp Andersen et al Mol. Cell. Biol. 2001

Some proteins are made of modular domains Ig FN III FN III FN III MAM P MAM Ig FN III

Model organisms Species Domain Kingdom Phylum E. coli Bacteria Yeast Eukaryota Fungi Dictyostelium Eukaryota Amoebozoa Arabidopsis Eukaryota Plantae C. elegans Eukaryota Animalia Nematoda D. melanogaster (fruit fly) D. raniro (zebrafish) M. musculus (mouse) Eukaryota Animalia Arthropoda Eukaryota Animalia Chordata Eukaryota Animalia Chordata H. sapiens Eukaryota Animalia Chordata

Evolution of a Sequence (review) red aa s: essential to protein function atg gcg gtg cgc att gaa acc ggc tat gaa ctg atg --- --- --- +-- --- --- -+- --- --- --+ --- --- M A V R I E T G Y E L M atg gcg gtg cgc att gaa Gcc ggc tat gaa ctg atg --- --- --- +-- --- --- -+- --- --- --+ --- --- M A V R I E A G Y E L M atg gcg gtg cgc att gaa Gcc ggc ttt gaa ctg agg --- --- --- +-- --- --- -+- --- --- --+ --- --- M A V R I E A G F E L R atg gcg gtg ctc att gaa Gcc ggc ttt gaa ctg agg --- --- --- +-- --- --- -+- --- --- --+ --- --- M A V L I E A G F E L R atg gcg gtg cgc att gca Gcc ggc ttt gaa ctg agg --- --- --- +-- --- --- -+- --- --- --+ --- --- M A V R I A A G F E L R

Evolution of a Sequence MAVLIESGFELR MAVRIEAGYELM

Homology Homology: features in different individuals that are descended from the same feature in a common ancestor. from Baxenvanis, A. D. and Ouelette, B. F. F., Bioinformatics 3 rd Ed., 2005 For example: human arm and bird wing are homologues. Analogous features: similar features in different individuals that result from natural selection. For example: body shape of dolphins and tuna.

Homology Homologous sequences: Sequences that are similar because they are descended from a common ancestor. Homologous structures: Structures that are similar because they are descended from a common ancestor. May not have high degree of sequence similarity.

Homologs orthologs: Result of speciation or split of one species into two. Similar function in different species. paralogs: produced by gene duplication in an organism. One species can have multiple paralogs. Related, but not identical, function - e.g. different proteases. Many proteases in one species all break peptide bonds, but each has different target amino acids or proteins.

Origins of Homologs A Gene duplication in one species A B Speciation: creation of new species B 1 B 2 A 1 A 2 A, A 1, and A 2 are orthologs A and B are paralogs A 1 and B 1 are paralogs A and B 2 are paralogs

Sequence alignments Overview What positions in two sequences are equivalent Pairwise alignments two sequences aligned Multiple sequence alignment (MSA) Local and Global alignments

What information can we get from an alignment? Are genes or proteins similar? infer similar function Look for presence of functional residues active sites in enzymes is your protein functional?

Local and Global Alignments Local: finds the most similar regions of two sequence Global: compares two sequences along their total length Local alignments help align modular proteins or genes. Choose the method that gives the best alignment score.

Local and Global Alignments A B local high scores global low score

How is Alignment Made? VRETERI VRATERI VRETERI VRATERI VRETERI VEITGEIST?

How is Alignment Made? VRETERI VEITGEIST 1. Two sequences are aligned 2. A score is given to the alignment 3. All possible alignments are made and scores assigned 4. Alignment with the best possible score is selected.

Scoring an Alignment Simple scoring system: Match = 1 Mismatch = 0 or -1 Gap = -1 Align and score these two sequences VRETERI VEITGEIST

Scoring an Alignment Simple scoring system: Match = 1 Mismatch = 0 or -1 Gap = -1 VRETERI VEITGEIST 100100100 = 3 Another alignment more identities, but many more gaps. The score is low VRE-T-ERI V-EITGE-IST 1111111110 = 1

Advanced Scoring Systems Scoring Matrices: PAM and BLOSUM Some substitutions in sequence can conserve the function of the sequence position. Assign different scores to each substitution not just 0 or 1 Assign score for opening a gap and extending a gap example (BLOSUM 62): E aligned with E = 4 E aligned with D = 2 E aligned with I = -3

Advanced Scoring Systems Scoring Matrices: PAM and BLOSUM Factors taken into account with scoring matrices: Conservation of residue function What residues can substitute for another Charge, size, hydrophobicity o amino acid properties chart Frequency of residue in all proteins Evolutionary patterns

Choosing a scoring matrix BLOSUM30 Long alignments of divergent sequences, <30% BLOSUM62 Most effective at finding potential similarities, 30-40% BLOSUM80 Detecting members of a conserved protein family, 50-60% BLOSUM90 Short alignments of highly similar sequences, 70-90% PAM250 Long alignments of divergent sequences, <30% PAM160 Detecting members of a conserved protein family, 50-60% PAM40 Short alignments of highly similar sequences, 70-90% Taken from Bioinformatics, A Practical Guide... 3 rd Ed., Baxevanis, A.D. and Ouellette, B. F. F., 2005, pg. 303

fasta format file ending.fa or.fasta) Why use fasta? Simple, portable, widely used, easy to edit. Starts with > ; then, first line is sequence information. The following line starts the sequence (need carriage return). Be careful of format make sure you have it all

BLAST -Proteins Use BLASTP to find homologs: 1. Determine function of a novel protein. 2. Find members of gene or protein family. 3. Find a candidate protein structure.

How BLAST works Given the sequence: KRPFIETAERLRDQHKKDYPEYKYQPRRR BLAST searches the database for the three-letter query words starting at each letter of the sequence. The word size can be changed in the parameters section. Examples of 3-letter words: KRPFIETAERLRDQHKKDYPEYKYQPRRR KRP IET... à search database RPF ETA... à search database PFI TAE... à search database FIE AER... à search database

How BLAST works The program searches for the exact match as well as three letter words containing conservative substitutions. Scores for each three-letter word are determined by a scoring matrix (i.e. BLOSUM62). For example, when the query word is RDQ, KRPFIETAERLRDQHKKDYPEYKYQPRRR The score for this match is 16 look at your BLOSUM62 matrix Note: Word size can be set in BLAST

How BLAST works Some possible three-letter words that match RDQ with their scores are: RDQ 16 QDQ 12 NDQ 11 RDN 11 EDQ 11 SDQ 10 KDQ 13 REQ 12 HDQ 11 RDD 11 MDQ 10 RDP 10 RDE 13 RDR 12 RNQ 11 RDH 11 ADQ 10 RDT 10 Similar tables are generated for each three-letter word in the query. Each of these possible words are used to extend the alignment or neighborhood around the first alignment.

How BLAST works A three-letter word that aligns with RDQ is found in a sequence in the database. query database KRPFIETAERLRDQHKKDYPEYKYQPRRR R+Q KRPFVEGAERLREQHKKDHPEYKYQPRRR Extension of the neighborhood: query KRPFIETAERLRDQHKKDYPEYKYQPRRR R+Q database QHK KRPFVEGAERLREQHKKDHPEYKYQPRRR Neighborhood cutoff set to 11 not adjustable in web version?

How BLAST works The alignment is then extended from this high-scoring segment pair (RDQ/REQ) in both directions and a score is calculated. Length of extension is determined by the score of the resulting alignment. query subject KRPFIETAERLRDQHKKDYPEYKYQPRRR KRPF+E AERLR+QHKKD+P+YKYQPRRR KRPFVEGAERLREQHKKDHPEYKYQPRRR The alignment is called a high-scoring alignment pair (HSP). These are possible alignments. More than one HSP can be generated for each query-subject pair.

How BLAST works The alignment is then extended from this high-scoring segment pair (RDQ/REQ) in both directions and a score is calculated. Length of extension is determined by the score of the resulting alignment. Extension stops when the score can t be improved by including more sequence. The resulting alignment is called a high-scoring segment pair (HSP). query subject KRPFIETAERLRDQHKKDYPEYKYQPRRR KRPF+E AERLR+QHKKD+P+YKYQPRRR KRPFVEGAERLREQHKKDHPEYKYQPRRR

How BLAST works The score of the HSP is used to calculate the E-value. The HSP is reported as a hit if the E-value is below the specified cutoff.

BLAST results What are your looking for? What did you get? Homologs - orthologs or paralogs Conserved domains in a new protein architecture paralog?

BLAST results Ø If you have a lot of sequence in your database you might expect to find any sequence pattern if you look hard enough. The likelihood of this happening is greater if the database is large and your query is small. Think about searching the nr with a five or ten amino acid segment. ELVIS in the genome!

BLAST results 1. E-value Number of instances where the match would occur by chance. Calculated from the length of the database, length of the query and the score of the HSP. No hard and fast rules for E-value: Lower E-values indicates significant hits. An E-value close to 0 indicates an identical match. E-value should be below 0.0001.

BLAST results 2. Bit score Measure of the significance of the alignment. 3. Percent identity 4. Length of alignment the longer the better 5. Beyond BLAST results is the hit a homolog?