C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 2.1Introduction smentionedinchapter1,severalmethodsareavailabletoclassifyhuman missensemutationsintoeitherbenignorpathogeniccategoriesandthese methodsusedifferentattributesrelatedtomissensemutationssuchas sequencebased(ngandhenikoff,2001,2003;ferrercostaetal.,2004;thomasetal., 2004;Capriottietal.,2006;Tianetal.,2007;Calabreseetal.,2009),evolutionarybased (Fayetal.,2001),physiochemicalproperties(StoneandSidow,2005),combinationof structuralandevolutionaryinformation(sunyaevetal.,2001;chasmanandadams, 2001;FerrerCostaetal.,2002;BrombergandRost,2007;Ramenskyetal.,2002;Stitziel etal.,2004;reumersetal.,2005;cavalloandmartin,2005;baoetal.,2005;ferrer Costaetal.,2005;Yueetal.,2006;Matheetal.,2006;Lietal.,2009;Adzhubeietal., 2010). Recently,aSVMbasedmethodhasbeendevelopedwhichusesanewsetoffeatures which refer to as neutraldisease missense mutation discriminatory (NDMSMD) features.thendmsmdfeaturesincludepositionspecificprobabilityscorescalculated usingdirichletmixtureofpriorinformationaswellasgribskov sapproach,predicted solventaccessibilityandsecondarystructuralfeatures,blosum62substitutionscores andchangeinfreeenergychangesassociatedwithbothwildtypeandmutantamino acidresidues.beforeutilizingthese10featuresinsvm,thedistributionpatternsofeach Page38
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... ofthefeaturesinknowndiseaseandneutralmissensemutationswerestudiedandthe detailsaregiveninthischapter. 2.2MaterialsandMethods 2.2.1 The Neutral Disease MisSense Mutation Discriminatory (NDMSMD) Features ThetenNDMSMDfeaturesincludesixpositionspecificfeatures,twoproteinstructure basedandtwoaminoacidresiduebasedfeatures(table2.1).thedetailsofcalculation ofthesefeaturesaregivenbelow. 2.2.2ThePositionSpecificFeatures Thesefeaturescorrespondtopositionspecificpreferencesofaminoacidresiduesatthe missensemutationsites.oneofthefeaturesisthepositionspecificprobabilityscore andtheotheristhegribskov sscore(gribskovetal.,1987). scores (Tianetal.,2007)werecalculatedusingDirichletmixtureasthepriorinformation availableintheformof20componentdirichletmixtures(dms)aswellastheobserved aminoacidfrequenciesfromthemultiplesequencealignmentofthetargetprotein sequenceanditshomologues(sjölanderetal.,1996,tianetal.,2007).thedetailsof calculationaregivenbelow. Page39
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... Table2.1:Thelistof10NDMSMDfeaturesanalyzedinthepresentstudy. Features Positionspecific Sequence & Structure based Amino acid residue based Description Positionspecificprobabilityscoreofthewildtypeaminoacid residues( @ Positionspecificprobabilityscoreofthemutanttypeaminoacid residues( ) @ Differencebetweenposition specificprobabilitiesscoreofthe Wild and the Mutant amino acid residues i.e., diff Gribskov sscoreofthewildtypeaminoacidresidues * Gribskov s Score of the mutanttype amino acid residues * DifferencebetweenGribskov sscoreofthewildtypeandthe mutantaminoacidresidues i.e.,diff( ) Solventaccessibilitystatusoftheaminoacidatthemutation site&&;1ifitisburied(solventaccessiblesurfaceareais<10%); 0ifitisexposed. Secondarystructuralstatusoftheaminoacidresidueatthe mutationsite**;1ifitisapartofalphahelix;2ifitisapartof extendedstrandor0forothertypes. Difference in transfer free energy values of wild type and mutatedtypefrominsidetosurfaceoftheprotein@@ BLOSUM62Substitutionscoresfor WildtypeMutatedTypeaminoacids. @These scores were calculated using the perl script psap.pl available at http://www.mobioinfor.cn/parepro/(tianetal.,2007). *ThesescoreswerecalculatedusingthePROPHECYprogramavailableintheEMBOSS suite(riceetal.,2000). &&Solventaccessibility calculatedfromaccpro4.0(chengetal.,2005). **Secondarystructureprediction calculatedfromssprov4.5(chengetal.,2005). @@Transferinfreeenergyvalues frominsidetooutsideofaglobularprotein(janin, 1979). Page40
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... isanestimationofthepositionspecificprobabilityofaminoacid a atthemis sensemutationsite b ofagivenhumanproteinsequence(query)andiscalculatedas follows (2.1) Where (2.2) Where isthemixturecoefficientofthecomponent m fromadirichletmixtureof priors, isthecorrespondingalphaparameter, t isthetotalnumberofcomponents whichis20, isthebetafunction,istheaminoacidcountvectorthatisobservedin themultiplesequencealignmentofthehumansequenceanditshomologues. And in the equation 2.2 is calculated by multiplication of total number of sequences S with (2.3) Where issumofthehenikoff spositionspecificweights(henikoffandhenikoff, 1996)calculatedforaminoacid a ataparticularposition b withoutgapsinthe multiplesequencealignmentofthequeryanditshomologues. (2.4) Page41
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... Where isthefrequencyofanyaminoacid a observedinthenature(jonesetal., 1992)andistheweightedcolumnobtainedfromthemultiplesequencealignment ofthequeryanditshomologueswithnogapsandiscalculatedasgivenbelowand calculatedasinequation2.5exceptitcontaingaps (2.5) Where N isthetotalnumberofalignedsequencesand (2.6) Where isthesequenceweightattheposition b ; isthenumberofdifferent aminoacidresiduetypesthatarepresentatposition b inthealignmentand isthe numberoftimesthataparticularresiduefromqueryproteinappearsinthealignment column. Gribskov sscore(gribskovetal.,1987)wascalculatedusingtheprophecy programavailableintheembosssuite(riceetal.,2000)usingmultiplesequence alignmentsusedforcalculating.theisdefinedasaweightedaverageofthe similarityscoresforanaminoacid a attheposition b, (2.7) Page42
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... WhereSM(a,aa)isthesimilaritymatrix(BLOSUM62)scoreforaminoacid a replacing anotheraminoacid aa and (2.8) Where isdefinedasthenumberofaminoacidresidues aa dividedbythetotal numberofaminoacidresiduespresentataposition b includinggapspresentinthe alignmentcolumn. Themultiplesequencealignments(MSAs)ofhumanproteinwithitshomologueswere calculatedusingclustalw2(chennaetal.,2003). The humanproteinhomologswere identifiedusingpsiblastsearchesagainstthenonredundant(nr)databasewithane value 1e15 (Mooney and Klein, 2002) with threefour rounds of iteration until convergenceisreached. Whilesearchingforhomologs,onlytherelevantdomaincontainingagivenmutation was given as the query. Domain boundaries were identified using ProDom (http://profom.prabi.fr/profom/current/html/home.php; Bru et al., 2005). From PSI BLASThits,humanproteinswereremoved.Furthermore,thehomologuesshorterthan 70%ofthequerysequencelengthwerealsoremovedbeforedoingtheMSAswith CLUSTALW2. Page43
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 2.2.3StructurebasedFeatures Twostructurebasedfeatureswereconsidered.Theyarethesolventaccessibilitystatus andsecondarystructuralstatus.accpro(chengetal.,2005)wasusedtopredictthe solventaccessibilityvaluesofaminoacidresidues.predictedsolventaccessibilityvalue< 10%wastakenasindicativeofaburiedresiduewhereasavalue>10%wasconsidered asexposedresidue.ssprov4.5ofthescratchsuite(chengetal.,2005)wasusedto predictthesecondarystructurestatusoftheaminoacidresidues. 2.2.4Freeenergyandpairwisesubstitutionscores Thedifferenceintransferfreeenergyvalues(Table2.2)frominsidetosurfaceofthe protein(janin,1979)ofwildtypeandmutantaminoacidswasusedasoneofthetwo amino acid features. The other feature was the pairwise substitution scores of BLOSUM62substitutionmatrix(HenikoffandHenikoff,1992)(Figure2.1)forwildtype andmutantresidues. 2.3Thedatasetofknownneutralanddiseasemissensemutations As mentioned in Chapter 1, various databases are available for disease/neutral variations.forthepurposesofthepresentanalysisandalsoforthetrainingandtesting ofsvmasreportedinchapter3,acurateddatasetknownashumvardataset(capriotti etal.,2006;tianetal.,2007;adzhubeietal.,2010)wasused.thisdatasethasbeen derivedfromtheswissprot/uniprotdatabase(capriottietal.,2006)andcomprisesof Page44
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... two categories of missense mutations viz., Disease and Polymorphism. This datasetatthetimeofthestudycomprisedof13,032 Disease from1111proteinsand 8,946 Polymorphism from3484proteins.theresultsreportedinthepresentstudyas wellasintheotherchapterspertaintothisdataset.recently,thedatasethasbeen revised and comprises of 12,598 Disease from 1101 proteins and 8,638 Polymorphism from 3399 proteins. All the missense mutations designated as PolymorphismwereconsideredasNeutralinthepresentstudy.Themainreasonfor usingthisdatasetinthepresentstudystemsfromthefactthatthisdatasethasbeen usedasabenchmarksetbyothergroups(capriottietal.,2006;tianetal.,2007; Adzhubeietal.,2010)whiledevelopingtheirmethodsforpredictionofpathogenic effectsofmissensemutations.thehumvardatasetisavailableattwowebsites:(a) ftp://genetics.bwh.harvard.edu/datasets/pph2/humvar2.0.17.tar.gz (Azdubehi et al., 2010)and(b)http://gpcr.biocomp.unibo.it/~emidio/PhDSNP/HumVar.txt(Capriottiet al.,2006).theoneusedinthepresentthesiswasdownloadedfromthesite(a). Page45
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... Table2.2:Transferfreeenergyvaluesof20aminoacidsfrominsidetooutsideofa protein(takenfromjanin,1979). Aminoacids Alanine Arginine Asparagine Aspartate Cysteine Glutamine Glutamate Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Trypsin Tyrosine Valine Freeenergyoftransferfrominsidetooutside ofaprotein 0.3 1.4 0.5 0.6 0.9 0.7 0.7 0.3 0.1 0.7 0.5 1.8 0.4 0.5 0.3 0.1 0.2 0.3 0.4 0.6 Page46
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4-1 -2-2 0-1 -1 0-2 -1-1 -1-1 -2-1 1 0-3 -2 0-2 -1 0-4 R -1 5 0-2 -3 1 0-2 0-3 -2 2-1 -3-2 -1-1 -3-2 -3-1 0-1 -4 N -2 0 6 1-3 0 0 0 1-3 -3 0-2 -3-2 1 0-4 -2-3 3 0-1 -4 D -2-2 1 6-3 0 2-1 -1-3 -4-1 -3-3 -1 0-1 -4-3 -3 4 1-1 -4 C 0-3 -3-3 9-3 -4-3 -3-1 -1-3 -1-2 -3-1 -1-2 -2-1 -3-3 -2-4 Q -1 1 0 0-3 5 2-2 0-3 -2 1 0-3 -1 0-1 -2-1 -2 0 3-1 -4 E -1 0 0 2-4 2 5-2 0-3 -3 1-2 -3-1 0-1 -3-2 -2 1 4-1 -4 G 0-2 0-1 -3-2 -2 6-2 -4-4 -2-3 -3-2 0-2 -2-3 -3-1 -2-1 -4 H -2 0 1-1 -3 0 0-2 8-3 -3-1 -2-1 -2-1 -2-2 2-3 0 0-1 -4 I -1-3 -3-3 -1-3 -3-4 -3 4 2-3 1 0-3 -2-1 -3-1 3-3 -3-1 -4 L -1-2 -3-4 -1-2 -3-4 -3 2 4-2 2 0-3 -2-1 -2-1 1-4 -3-1 -4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5-1 -3-1 0-1 -3-2 -2 0 1-1 -4 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 0-2 -1-1 -1-1 1-3 -1-1 -4 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6-4 -2-2 1 3-1 -3-3 -1-4 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7-1 -1-4 -3-2 -2-1 -2-4 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 1-3 -2-2 0 0 0-4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5-2 -2 0-1 -1 0-4 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 2-3 -4-3 -2-4 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7-1 -3-2 -1-4 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4-3 -2-1 -4 B -2-1 3 4-3 0 1-1 0-3 -4 0-3 -3-2 0-1 -4-3 -3 4 1-1 -4 Z -1 0 0 1-3 3 4-2 0-3 -3 1-1 -3-1 0-1 -3-2 -2 1 4-1 -4 X 0-1 -1-1 -2-1 -1-1 -1-1 -1-1 -1-1 -2 0 0-2 -1-1 -1-1 -1-4 * -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4 1 Figure2.1:BLOSUM62aminoacidsubstitutionscores(HenikoffandHenikoff,1992). Page47
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 2.4ResultsandDiscussion 2.4.1Distributionofpositionspecificfeatures 2.4.1.1Scores Itshouldbenotedthatthescores andforagivenaminoacidresidue a at position b essentiallyindicateitspreferenceorotherwiseinaprotein.preferenceis indicatedbyscoresof >0.05and>0.0andnonpreferenceisindicatedby scoresof <0.05and<0.0.Needlesstomentionallwildtype(WT)amino acidresiduesaswellasneutralmutationsshouldthereforeshowscorescorresponding topreferenceswhereasdiseasemutationsshouldshowscorescorrespondingtonon preferences. Thedistributionofscoresfordiseasemutationsaswellasneutralmutationsfromthe HumVardatasetshasbeeninvestigated.(Figure2.2(a)).Amongthediseasemutations, about91%inhumvardatasetsshowedscores<0.05indicatingtheexcellentutilityof thesepositionspecificscoresforpredictingdiseasemutations.however,inthecaseof neutralmutations,only43%showedscores>0.05.thisindicatedthatnotalltheneutral mutationsarethepreferredsubstitutions.tomakesurethatscoresindicatepreferred aminoacidresiduescorrectly,ihaveinvestigatedthedistributionof scoresforthe wildtype(wt)aminoacidresiduesfromthehumvardatasets(figure2.2(b)).more than 95% of the WTs show scores >0.05. The fact that about half of the neutral Page48
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... mutations of the HumVar dataset show scores <0.05 (indicative nonpreference) indicate that these may be the mutations occurring infrequently though their phenotypic effect is benign. This could also be due to annotation mistakes in the SWISSPROTdatabasefromwhereHumVardatasetwascurated. Thedistributionofdifferencesbetween scoresofwildtype(wt)andmutanttype (MT)(Figure2.2(c))wasalsoexaminedwhichshowedtwodistinct(butoverlapping) distributionsfordiseaseandbenignmutations. 2.4.1.2Scores Ascorelessthan0indicatesthatresidue a atposition b isnotpreferredand henceisindicativeofapathogenicmutation.ontheotherhand,ascore>0.0indicates preferenceandisindicativeofaneutralmutation.fromfigure2.3(a),itcanbeseen that77%ofthediseasemutationscorrespondtoscores<0.0andabout50%ofneutral mutations have scores >0.0. Although this result indicates thatscore is less discriminating than however, it is still a good discriminator. For comparison purposes,distributionofwildtypeaminoacidresidueswasexamined(figure2.3(b)). About 90% of them showed values >0 further indicating utility of the score. The differencescoresbetweenofwildtypeandmutanttype(figure2.3(c))show distinctbutoverlappingdistributionfordiseaseandbenignmutations. Page49
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 100% 90% 91% 80% 70% 60% 57% 50% 40% 43% 30% 20% 10% 9% 0% MT<0.05 MT>0.05 NEUTRAL DISEASE Figure2.2(a):Thedistributionof scoreofmutatedaminoacid(mt)fordiseaseand benignmutationsinthehumvardatasets. Page50
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 100 96.8 90 80 70 60 50 40 30 20 10 0 3.2 WT<0.05 WT>0.05 Percentage Figure2.2(b):Thedistributionof scoreofwildtypeaminoacid(wt)forhumvar datasets. Page51
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 100 90 80 70 percentage 60 50 40 30 20 10 0 N D diffpab(wtmt) Figure2.2(c):Distributionofdifferencein (WT)and (MT)scoresformissense mutationsinhumvardataset.n=neutralandd=diseasemutations. Page52
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 100% 90% 80% 77% 70% 60% 50% 40% 50% 47% 30% 20% 23% 10% 0% MT<0 MT>0 NEUTRAL DISEASE Figure2.3(a):DistributionofscoresforDisease(D)andNeutral(N)mutations. N=NeutralandD=Diseasemutations. Page53
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 100 90 90 80 70 60 50 40 30 20 10 0 10 WT<0 WT>0 Percentage Figure2.3(b):Distributionofscorescalculatedforthewildtype(WT)amino acidresiduesinthehumvardataset. Page54
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 100 90 80 70 percentage 60 50 40 N D 30 20 10 0 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 diffgab(wtmt) Figure2.3(c):Distributionofdifferencein(WT)and(MT)scoresforthe mutationsinhumvardataset(n=neutralandd=diseasemutations). Page55
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 2.4.2StructureandAminoacidbasedfeatures. 2.4.2.1Solventaccessibilitystatusofthemissensemutations Asalreadymentioned,thesolventaccessibilityvaluesofthemissensemutationswere predictedusingaccproprogram(chengetal.,2005)availableasapartofscratch package.ifthepredicted%accessiblesurfaceareaislessthan10thenthemutationwas consideredasburiedandifthe%accessiblesurfaceareaismorethan10thenthe mutationwasconsideredexposed. Figure2.4showsthedistributionoftheHumVarmutationsintoburiedandexposed categories.about77%oftheneutralmutationsarepredictedtobeexposedtosolvent indicatingthatthemutationsoftheexposedresidueshaveaneutraleffect.among diseasemutationsabout52%arepredictedtobeburiedandtheremaining(48%)are predictedtobeexposedtosolvent.onewouldhaveexpectedamajorityofdisease causingmutationstoburiedaschangesattheburiedpositionwoulddestabilizeprotein structureleadingtopathogeniceffects(wangandmoult,2001;sunyaevetal.,2000; GongandBlundell,2010).However,thecurrentanalysisindicatesthatevenmutations attheexposedpositionstoocanleadtopathogeniceffects(kowarschetal.,2010).this resultisnotentirelysurprisinggiventhefactthatsomeofthesurfacepositionscould form parts of binding surfaces involved in intermolecular interactions and hence mutationsatsuchfunctionallyimportantsiteswouldinvariablyyieldpathogeniceffects. Page56
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 90 80 70 60 percentage 50 40 30 NEUTRAL DISEASE 20 10 0 Buried Exposed StatusofSolventAccessibility Figure 2.4: The distribution of status of mutations as buried or exposed for the mutationsinthehumvardatasets. Page57
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 2.4.2.2Secondarystructuralstatusofthemissensemutations Distributionofdiseaseandneutralmutationsinthethreesecondarystructurestates viz.,helices,strandsandcoilregionsareshowninfigure2.5.ascanbeseen, diseaseaswellasneutralmutationsdoshowsomelocalizationtendency.forexample, diseasemutationsarepredictedtobelocalizedmoreonhelicesandbetastrandsthan neutral mutations. Comparatively, coil regions harbor more number of neutral mutationsthandiseasemutations. 2.4.2.3Aminoacidbasedfeatures. Differenceinfreeenergytransfervaluesbetweenthewildtypeandthemutationwas calculatedforeachentryinthehumvardataset.asmentionedalready,thefreeenergy transfervaluesofthe20aminoacidresidues(table2.2)weretakenfromjanin(1979). Althoughforeveryintervalstartingfrom2.0to+2.0(Figure2.6),bothdiseaseand neutral mutations can be seen, however, their proportions vary along the entire spectrum. The proportion of disease mutations increases at every interval for differences greater than 0.2 indicating energetically unfavoured situations in those cases. DistributionofBLOSUM62substitutionscoresforbothdiseaseandneutralmutations (Figure2.7)werealsoexamined.Ascoreof<1and>+1indicatespreferredandnon Page58
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 60 50 40 percentage 30 20 NEUTRAL DISEASE 10 0 Helix Strand Rest StatusofSecondaryStructure Figure2.5:ThedistributionofsecondarystructurestatusfortheHumVardatasets. Page59
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 30 25 20 perecntage 15 10 NEUTRAL DISEASE 5 0 2 1.5 1 0.5 0 0.5 1 1.5 2 Difffreeenergy(WTMT) Figure2.6:Thedistributionoftransferfreeenergyvaluesfrominsidetosurfaceofthe proteinfordifferenceinfreeenergyfrominsidetooutsideoftheprotein. Page60
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 70 60 50 percentage 40 30 Neutral Disease 20 10 0 <1 >1 BLOSUM62SCORE Figure 2.7: The BLOSUM62 substitution score distribution for disease and neutral mutations Page61
C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... preferred substitutions respectively. Of the disease mutations, 65% correspond to scores<1whereas60%ofneutralmutationscorrespondtoscores>+1.thisresult reconfirms the observations made by earlier reports (Cargill et al., 1999; Balasubramanianetal.,2005). 2.5Summary Inthischapter,Ihaveinvestigatedsomesequenceandstructurebasedfeaturestobe usedinthesvmbasedmethodrelatedtothediseaseandbenignmutations.ihave used 10 new set of features refer to as neutraldisease missense mutation discriminatory(ndmsmd)features.thetenndmsmdfeaturesincludesixposition specific features, two protein structure based and two amino acid residue based features.thisstudyalsorevealedthediscriminatorypowerofalltenindividualfeatures tobeusedinthepredictionofpathogenicmutations.thenextchapterdealswiththe usageof10ndmsmdfeaturesinthenewlydevelopedsvmbasedmethodforthe predictionofpathogenicmutations. Page62