Basics on bioinforma-cs Lecture 7. Nunzio D Agostino

Basics on bioinforma-cs Lecture 7 Nunzio D Agostino nunzio.dagostino@entecra.it; nunzio.dagostino@gmail.com

Multiple alignments One sequence plays coy a pair of homologous sequence whisper many aligned sequences shout out loud 2

Multiple alignments Mul=ple nucleo=de or amino sequence alignment techniques are usually performed to fit one of the following scopes: o in order to characterize protein families, iden=fy shared regions of similarity in a mul=ple sequence alignment; o determina=on of the consensus sequence of several aligned sequences. o help predic=on of the secondary and ter=ary structures of new sequences; o preliminary step in molecular evolu=on analysis using phylogene=c methods for construc=ng phylogene=c trees. 3

Multiple alignments programs Adapted from Current Opinion in Structural Biology 2006, 16:368 373.

ClustalW ClustaW is a general purpose mul=ple alignment program for DNA or proteins W stands for weighted (different parts of alignment are weighted differently) The most prac=cal and widely used method in mul=ple sequence alignment is the hierarchical extensions of pairwise alignment methods. The principal is that mul=ple alignments is achieved by successive applica=on of pairwise methods. The three basic steps in the CLUSTAL W approach are shared by all progressive alignment algorithms: o Calculate a matrix of pairwise distances based on pairwise alignments between the sequences o Use the result of A to build a guide tree, which is an inferred phylogeny for the sequences o Use the tree from B to guide the progressive alignment of the sequences 5

clustalw: calculate pairwise distance Aligns each sequence again each other giving a similarity matrix Similarity = exact matches / sequence length (percent iden=ty) s1 s1 s2 s3 s4 - s2.17 - s3.87.28 - s4.59.33.62 - (.87 means 87 % iden=cal) 6

clustalw: create guide tree Create Guide Tree using the similarity matrix: o ClustalW uses the neighbor- joining method o Guide tree roughly reflects evolu=onary rela=ons s 1 s 3 s 4 s 2 Calculate: s 1,3 = alignment (s 1, s 3 ) s 1,3,4 = alignment((s 1,3 ),s 4 ) s 1,2,3,4 = alignment((s 1,3,4 ),s 2 ) 7

clustalw: progressive alignment Start by aligning the two most similar sequences. Following the guide tree, add in the next sequences, aligning to the exis=ng alignment. Insert gaps as necessary. s1 s3 s4 s2 PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ.. : **. :.. *:.* *. * **: 8

clustalw: progressive alignment Works by progressive alignment: it aligns a pair of sequences then aligns the next one onto the first pair. Most closely related sequences are aligned first, and then addi=onal sequences and groups of sequences are added, guided by the ini=al alignments. Uses alignment scores to produce a phylogene=c tree. Aligns the sequences consecu=vely, guided by the phylogene=c rela=onships indicated by the tree. Gap penal=es can be adjusted based on specific amino acid residues, regions of hydrophobicity, proximity to other gaps, or secondary structure. Is available with a great web interface: hep://www.ebi.ac.uk/clustalw/ Also available as ClustalX (stand- alone MS- Windows sogware) 9

MSA: multiple sequence alignment A MulFple Sequence Alignment of a set of biosequences is a rectangular arrangement, where each row consists of one sequence padded by gaps, such that the columns highlight similarity/conserva=on between posi=ons. * indicates posi=ons which have an amino acid that is the same in all sequences : indicates posi=ons with amino acids with strongly similar proper=es in all sequences (i.e. score > 0.5 in PAM 250 matrix). indicates posi=ons with amino acids with weakly similar proper=es in all sequences 10

Colouring the alignment according by: conserva=on iden=ty percentage hydrophobicity user defined 11

Colouring the alignment according by: conserva=on iden=ty percentage hydrophobicity user defined 12

Colouring the alignment according by: conserva=on iden=ty percentage hydrophobicity user defined 13

Colouring the alignment according by: conserva=on iden=ty percentage hydrophobicity user defined Aroma=c amino acids 14

Sequence logo A sequence logo is a graphical representa=on of an amino acid or nucleic acid mul=ple sequence alignment. Each logo consists of stacks of symbols, one stack for each posi=on in the sequence. The overall height of the stack indicates the sequence conserva=on at that posi=on, while the height of symbols within the stack indicates the rela=ve frequency of each amino or nucleic acid at that posi=on. In general, a sequence logo provides a richer and more precise descrip=on of, for example, a binding site, than would a consensus sequence. WebLogo is a web based applica=on designed to make the genera=on of sequence logos easy and painless.

Weblogo