Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST By: Hadi Mozafari KUMS

SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence is known from experiments!!! Thinking by analogy Assuming that if the sequence is similar, the function is also similar Is it contaminated with vector sequences? Is it an already known gene? Is it related to any other genes either by having a common ancestor? Is it similar in function to other genes via convergent evolution? What could the protein sequence be for this nucleotide fragment if it is translated and what might this be like?

Similarity

Limits of Similarity

Significance of Alignment

Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information from a well studied to a newly determined sequence, we need an alignment that represents the protein structures of today.

Sequence Alignment In phylogeny one wants to line up residues that came from a common ancestor. For information transfer one wants to line up residues at similar positions in the structure. gap = insertion ór deletion

Global versus Local Alignment Global Local

An Example for Proteins 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP..... :..: : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : ::.. :. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...QYSC 136 RBP. :... 94 IPAVFKIDALNENKVL...VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. :. 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI... 178 lactoglobulin

Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP..... :..: : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : ::.. :. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...QYSC 136 RBP. :... 94 IPAVFKIDALNENKVL...VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin Identity 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. :. (bar) 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI... 178 lactoglobulin

Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP..... :..: : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : ::.. :. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...QYSC 136 RBP. :... 94 IPAVFKIDALNENKVL...VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin Somewhat 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. :. similar similar 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI... 178 lactoglobulin (one dot) Very (two dots)

Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP..... :..: : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : ::.. :. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...QYSC 136 RBP. :... 94 IPAVFKIDALNENKVL...VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. :. 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI... 178 lactoglobulin Internal gap Terminal gap

Global Alignment Align two sequences from head to toe, i.e. from 5 ends to 3 ends from N-termini to C-termini Algorithm published by: Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443-453.

Global Alignment a a c t t g a g c - c -6 t We fill-up this matrix backwards, -5 g using a very simple scorings -4 a scheme. Identity = 1. Other = 0. -3 Gaps cost -1. g -2 t -1 - -9-8 -7-6 -5-4 -3-2 -1 0

Global Alignment a a c t t g a g c - c -6 t Score = -5 g Where you came from + -4 Gap penalty + a -3 Similarity score g -2 t -1 - -9-8 -7-6 -5-4 -3-2 -1 0

Local Alignment Locate region(s) with high degree of similarity in two sequences Algorithm published by: Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147:195-197.

Alignment methods Rigorous algorithms Needleman-Wunsch Smith-Waterman Heuristic algorithms BLAST FASTA

The Needleman-Wunsch algorithm, published in 1970, provides a method of finding the optimal global alignment of two sequences by maximizing the number of amino acid matches and minimizing the number of gaps necessary to align the two sequences. The Smith-Waterman algorithm was published in 1981 and is very similar to the Needleman-Wunsch algorithm. Yet, the Smith-Waterman algorithm is different in that it is a local sequence alignment algorithm. Instead of aligning the entire length of two protein sequences, this algorithm finds the region of highest similarity between two proteins. Heuristic Methods Thus far, we have discussed optimal sequence alignment methods which find the highest scoring alignment for any pair of protein sequences. However, these algorithms tend to be slow, and when searching an entire database, these methods are often too slow to perform a search in reasonable time. Thus, heuristic, or approximate, algorithms like FASTA and BLAST were developed to speed up the process while attempting to keep as much sensitivity as possible. BLAST The BLAST (Basic Local Alignment Search Tool) algorithm was developed by Altschul et al. in 1990 and similar to the FASTA algorithm, is also a heuristic pairwise sequence aligner. Using a substitution matrix, a list of other words, called a neighborhood, is created for each word found in the protein sequence; these words must be related to the original word and must have a substitution matrix score higher than T, else they are not considered. For fast access to these data, the word positions are entered into a hash table.

Pairwise comparison Local alignment Identify the most similar region shared between two sequences Smith-Waterman Global alignment Align over the length of both sequences Needleman-Wunsch

Global local alignment TEGNAP VELED VOLTAM TEGNAP VELED MAGOLTAM VELE DALOLTAM :::::::::::: : ::::: TEGNAP VELED----------V-------OLTAM Global TEGNAP VELED MAGOLTAM VELE DALOLTAM ::::::::::::.::::: TEGNAP-VELED---VOLTAM-------------- TEGNAP VELED MAGOLTAM VELE DALOLTAM ::::::::::::.::::: TEGNAP VELED ----------------VOLTAM TEGNAP VELED MAGOLTAM VELE DALOLTAM :::::: :::: :.::::: TEGNAP----------------VELE-D-VOLTAM Local TEGNAP VELED MAGOLTAM ::::::::::::.::::: TEGNAP VELED---VOLTAM TEGNAP VELED :::::: ::::: TEGNAP VELED VELE DALOLTAM :::: :.::::: VELE-D-VOLTAM

Multiple Sequence Alignment (MSA) and Trees Take, for example, the three sequences: 1 ASWTFGHK 2 GTWSFANR 3 ATWAFADR and you see immediately that 2 and 3 are close, while 1 is further away. So the tree will look roughly like: 3 2 1

لغات و عبارات مفید در همردیفی توالی

Scoring Matrix/Substitution Matrix To score quality of an alignment Contains scores for pairs of residues (amino acids or nucleic acids) in a sequence alignment For protein/protein comparisons: a 20 x 20 matrix of similarity scores where identical amino acids and those of similar character (e.g. Ile, Leu) give higher scores compared to those of different character (e.g. Ile, Asp). Symmetric

Protein Scoring Systems Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. aliphatic I L C S+S V A G T P G C SH S D N tiny small hydrophobic aromatic M F Y W H K E Q R charged positive polar

Protein Scoring Systems Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. Scoring matrices reflect probabilities of mutual substitutions the probability of occurrence of each amino acid. Widely used scoring matrices: PAM BLOSUM

DNA Scoring Systems Sequence 1 Sequence 2 actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact Negative scoring values to penalize mismatches: A T C G A 5-4 -4-4 T -4 5-4 -4 C -4-4 5-4 G -4-4 -4 5 Matches: 5 Mismatches: 19 Score: 5 x 5 + 19 x (-4) = - 51

Dotplots CCTCCTTTGT 5 5 5 5 5 5 5 5 5 5 Point = 50 A T G C A 5-4 -4-4 T -4 5-4 -4 G 4-4 5-4 C -4-4 -4 5 CCTCCTTTGT Pro Leu CCTCCTTTGG 5 5 5 5 5-4 5 5 5 CCTCCCTTAG -4 Point = 32 Pro Leu

Substitution Matrices Not all amino acids are equal Residues mutate more easily to similar ones Residues at surface mutate more easily Aromatics mutate preferably into aromatics Mutations tend to favor some substitutions Core tends to be hydrophobic Selection tends to favor some substitutions Cysteines are dangerous at the surface Cysteines in bridges seldom mutate

Which Matrix to use? Close relationships (Low PAM, high Blosum) Distant relationships (High PAM, low Blosum) BLOSUM 80 BLOSUM 62 BLOSUM 45 PAM 20 PAM 120 PAM 250 More conserved More variable Often used defaults are: PAM250, BLOSUM62

BLOSUM62 Substitution Matrix Zero: by chance + more than chance - less than chance Arranged by Sidegroups So, high scoring in the end boxes Example M,I,L,V Interchangeable

PAM250 Matrix

Scoring example Score of an alignment is the sum of the scores of all pairs of residues in the alignment sequence 1: TCCPSIVARSN sequence 2: SCCPSISARNT 1 12 12 6 2 5-1 2 6 1 0 => alignment score = 46

BLAST Question: What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence? BLAST finds the highest scoring locally optimal alignments between a query sequence and a database. It compares new genes to old ones from different species or hosts and possible functions based on similarities to known sequences. Very fast algorithm Can be used to search extremely large databases Sufficiently sensitive and selective for most purposes Robust the default parameters can usually be used

BLAST is like using Google for DNA sequences

BLAST Algorithme Step 1: Read/understand user query sequence. Step 2: Use hashing technology to select several thousand likely candidates. Step 3: Do a real alignment between the query sequence and those likely candidate. Real alignment is a main topic of this course. Step 4: Present output to user.

Steps in running BLAST: BLAST Input Entering your query sequence (cut-and-paste) Select the database(s) you want to search Choose output parameters Choose alignment parameters (scoring matrix, filters,.) Example query= MAFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPWQVTLQDRSGFHFC GGSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDEDNIQVLRIAKVFKQPKYSILTVNND ITLLKLASPARYSQTISAVCLPSVDDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT NAECKRSWGRRLTDVMICGAASGVSSCMGDSGGPLVCQKDGAYTLVAIVSWASDTCSASS GGVYAKVTKIIPWVQKILSSN

Alignment Significance in BLAST P-value (probability) Relates the score for an alignment to the likelihood that it arose by chance. The closer to zero, the greater the confidence that the hit is real. E-value (expect value) The number of alignments with E that would be expected by chance in that database (e.g. if E=10, 10 matches with scores this high are expected to be found by chance). A match will be reported if its E is below the threshold. Lower E thresholds are more stringent, and report fewer matches.

BLAST Types

BLAST programs Program Input Database 1 blastn DNA DNA 1 blastp protein protein 6 blastx DNA protein 6 tblastn protein DNA 36 tblastx DNA DNA

راهنمایی نتایج در بالست

Database Searching Overview Query sequence Q List of similar protein sequences Comparison algorithm Database of sequences Infer homologues and similar structures

Search with Protein not DNA 1) 4 DNA bases vs. 20 amino acids - less random similarity 2) Can have varying degrees of similarity between different aminoacids 3) Protein databanks are much smaller than DNA databanks.

Pairwise alignment: protein sequences can be more informative than DNA Many times, DNA alignments are appropriate --to confirm the identity of a cdna --to study noncoding regions of DNA --to study DNA polymorphisms --example: Neanderthal vs modern human DNA Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247

BLAST Databases

BLAST output

Graphic Display Colors

Output Parts

Practical: Go to BLAST in NCBI

Select BLAST Type

blastn

Results Page: Graphic view

Alignments

Target Sequence in NCBI

Blasx parameters

tblastx Parameters