Multiple Alignment using Hydrophobic Clusters : a tool to align and identify distantly related proteins J. Baussand, C. Deremble, A. Carbone Analytical Genomics Laboratoire d Immuno-Biologie Cellulaire et Moléculaire des infections parasitaires, INSERM U511 91, boulevard de l Hôpital 75013 Paris, France
Alignment and Homology Search Reliability (1) A ) Structure Good reliability B ) Sequence MFPDAHCELVHRNPFELLIAVVLSAQ MFPDAHCEL -VHRNPFELLIAVVLSAQ PGLRLPGC-----VDAFEQGVRAILGQL YHDNEWGVPETDSKKLFEMICLEGQQAG E.E. Hill and S. E. Brenner, IPAM Structural Proteomics, 2004
Alignment and Homology Detection Reliability (2) Introduction of structural information in 1) sequences 2) alignment parameters: - substitution matrices (Overington et al., 1992, Teodorescu et al., 2004) - gap penalties (Lesk et al., 1986)
Hydrophobic Clusters Hydrophobic core Folding stability Surface Protein interactions
Prediction of Hydrophobic Clusters in Sequences HCA α-helical 2D representation 1D representation N-ter C-ter Specific periodicity : +/- 1, (2,) 3, 4 Manual alignment Automatic alignment Gaboriaud et al., 1987 Baussand, Deremble and Carbone, in preparation
Hydrophobic Clusters Properties 145 protein families : 613 sequences ( < 30 % identity pairwise ) % RSS overlapping HC : 89.8% % HC overlapping RSS : 85.7% THC RSS HC THC FHC Mean length : 8.4 8.1 8.8 4.2 Mean % solvent 25.3 24.4 23.7 33.2 accessible surface : % Identity: 20.0 20.6 21.3 27.0 % Hydrophobic 61.1 72.0 72.1 72.0 position conserved :
Alignment of Sequences using Hydrophobic Clusters All residues under the same evolution pressure Evolution pressure in Structure > out of Structure Substitution Matrix Gap penalties Structure specific Substitution Matrix Gap penalties Out of structure specific Substitution Matrix Gap penalties W A F G A W A F G A P P L L W H W I in HC out of HC Thompson et al., 1995 48 substitution matrices (24 in structure, 24 out of structure)
Evaluation of the HC Fitting Matrices (1) 8 homologous couples of protein with reference alignments 24 matrices In struct., 24 matrices Out struct. GOP, GEP : 0 to 15 cgop = GOP and cgep = GEP 4 matrices (HSDM, Blosum30, Blosum62, Gonnet) GOP, GEP : 0 to 15 % Correctly Aligned Paires (% CAP) Results with optimized parameters for each couple Alignment landscape
Evaluation of the HC fitting matrices (2) CpG binding proteins (α/β, 24% identity) Landscape of the % CAP according to gap penalties : Blosum62 GEP 66 % 67 % GEP GOP GOP
Evaluation of the HC fitting matrices (2 ) HSDM Landscape of the % CAP according to gap penalties GEP α 26% α 13% α 5% β 13% β 9% β 11% α/β 24% α 16% Blosum62 GOP
Evaluation of the HC fitting matrices (3) Parameters for best average on the 8 couples : Matrix GOP-GEP Mean % CAP 2 matrices 14 2 55.4 HSDM (Prlic et al., 2000) 1 11 48.3 Blosum62 (Henikoff, 1992) 4 0 48.0 Gonnet (Gonnet et al., 1992) 0 2 46.0 Blosum30 (Henikoff, 1992) 2 0 41.9
Evaluation of the HC fitting matrices (4) Average landscape Blosum62 HSDM GEP GOP
Tests for Evaluation of the HC fitting approach 8 couples of protein with reference alignments 1 matrix out Struct. GOP, GEP : 0 to 15 1 matrix in Struct. cgop, cgep : 0 to 15 4 matrices (HSDM, Blosum30, Blosum62, Gonnet) GOP, GEP : 0 to 15 % Correctly Aligned Paires (% CAP) Alignment results with optimized parameters for each couple
Evaluation of the HC fitting gap penalties Best Results with Optimized Parameters for the 8 couples : Matrix % CAP matrices 76.3 97.4 87.8 80.0 36.3 74.0 27.5 45.2 65.8 +8.3% +8.9% = +10.8% +1.5% +1.6% = = HSDM (Prlic et al., 2000) 73.4 86.6 87.2 54.5 7.0 50.8 23.7 38.2 52.7 Blosum62 (Henikoff, 1992) 66.0 92.3 77.2 70.8 41.7 55.1 31.1 32.8 58.3 Gonnet (Gonnet et al., 1992) 61.3 90.3 73.3 70.8 8.7 46.9 31.5 39.5 52.8 Blosum30 (Henikoff, 1992) 76.1 81.2 69.0 48.5 16.1 48.0 21.0 30.1 48.7 %Sequence Id 24% 26% 16% 13% 5% 13% 9% 11% Structural class α/β α α α α β β β
Sequence Alignment Approaches Comparison Plastocyanin Azurin ( β, 13% Id ) HSDM HCA + Manual Alignment (Gaboriaud et al., 1987)
Improvement of Remote Protein Alignment HC fitting approach : Improvement of pairwise sequence alignment for distantly related proteins Multiple Alignment : Distance matrix for the phylogenetic guide tree Homology Detection Usually : Alignment score
Evaluation of Homogy and Phylogenetic Distance Cpg Binding proteins Landscape of %CAP Score Score Blosum62 HSDM Alignment Score + Evaluation of Hydrophobic Clusters superimposition SOV (Zemla et al., 1993) Detect homologous protein (< 30 % identity) Evaluate Distance among 2 sequences + - + -
Perspectives Target sequence Local Database (Trembl, Swissprot, ) Pairwise alignement Score : homologous? Multiple Alignement Set of homologous sequences + distances
Acknowlegements Alessandra Carbone Sophie Abby SOV index development and analysis Thomas Rolland Database development and Web application Lab web address :http://www.ihes.fr/~carbone/index.htm