C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... methods use different attributes related to mis sense mutations such as

Similar documents
Proteins: Characteristics and Properties of Amino Acids

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Protein Secondary Structure Prediction

Range of Certified Values in Reference Materials. Range of Expanded Uncertainties as Disseminated. NMI Service

PROTEIN STRUCTURE AMINO ACIDS H R. Zwitterion (dipolar ion) CO 2 H. PEPTIDES Formal reactions showing formation of peptide bond by dehydration:

Translation. A ribosome, mrna, and trna.

Properties of amino acids in proteins

Proteome Informatics. Brian C. Searle Creative Commons Attribution

Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell

Periodic Table. 8/3/2006 MEDC 501 Fall

Lecture 14 - Cells. Astronomy Winter Lecture 14 Cells: The Building Blocks of Life

CHAPTER 29 HW: AMINO ACIDS + PROTEINS

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS

Chemistry Chapter 22

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems

The Select Command and Boolean Operators

Part 4 The Select Command and Boolean Operators

Patrick: An Introduction to Medicinal Chemistry 5e Chapter 03

Enzyme Catalysis & Biotechnology

Hypergraphs, Metabolic Networks, Bioreaction Systems. G. Bastin

Lecture 15: Realities of Genome Assembly Protein Sequencing

EXAM 1 Fall 2009 BCHS3304, SECTION # 21734, GENERAL BIOCHEMISTRY I Dr. Glen B Legge

12/6/12. Dr. Sanjeeva Srivastava IIT Bombay. Primary Structure. Secondary Structure. Tertiary Structure. Quaternary Structure.

Exam III. Please read through each question carefully, and make sure you provide all of the requested information.

Proteomics. November 13, 2007

Protein Struktur (optional, flexible)

Protein Structure Bioinformatics Introduction

1. Amino Acids and Peptides Structures and Properties

INTRODUCTION. Amino acids occurring in nature have the general structure shown below:

Amino Acids and Peptides

Solutions In each case, the chirality center has the R configuration

Scoring Matrices. Shifra Ben-Dor Irit Orr

Evidence from Evolution Activity 75 Points. Fossils Use your textbook and the diagrams on the next page to answer the following questions.

A rapid and highly selective colorimetric method for direct detection of tryptophan in proteins via DMSO acceleration

How did they form? Exploring Meteorite Mysteries

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Discussion Section (Day, Time):

Protein Struktur. Biologen und Chemiker dürfen mit Handys spielen (leise) go home, go to sleep. wake up at slide 39

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

Viewing and Analyzing Proteins, Ligands and their Complexes 2

Read more about Pauling and more scientists at: Profiles in Science, The National Library of Medicine, profiles.nlm.nih.gov

Discussion Section (Day, Time): TF:

Chemical Properties of Amino Acids

Discussion Section (Day, Time):

National Nutrient Database for Standard Reference Release 28 slightly revised May, 2016

All Proteins Have a Basic Molecular Formula

DATA MINING OF ELECTROSTATIC INTERACTIONS BETWEEN AMINO ACIDS IN COILED-COIL PROTEINS USING THE STABLE COIL ALGORITHM ANKUR S.

Structures in equilibrium at point A: Structures in equilibrium at point B: (ii) Structure at the isoelectric point:

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Studies Leading to the Development of a Highly Selective. Colorimetric and Fluorescent Chemosensor for Lysine

Protein Structure Marianne Øksnes Dalheim, PhD candidate Biopolymers, TBT4135, Autumn 2013

Systematic approaches to study cancer cell metabolism

The Structure of Enzymes!

The Structure of Enzymes!

Protein Structure. Role of (bio)informatics in drug discovery. Bioinformatics

BCH 4053 Exam I Review Spring 2017

Lecture'18:'April'2,'2013

Generation Date: 12/07/2015 Generated By: Tristan Wiley Title: Bio I Winter Packet

Performing pka Calculations in Proteins Using Free Energy Perturbation Adiabatic Charging (FEP/AC)

1. Wings 5.. Jumping legs 2. 6 Legs 6. Crushing mouthparts 3. Segmented Body 7. Legs 4. Double set of wings 8. Curly antennae

Lecture 7. Protein Secondary Structure Prediction. Secondary Structure DSSP. Master Course DNA/Protein Structurefunction.

Discussion Section (Day, Time):

1 large 50g. 1 extra large 56g

Protein Secondary Structure Prediction

Molecular Selective Binding of Basic Amino Acids by a Water-soluble Pillar[5]arene

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

4. The Michaelis-Menten combined rate constant Km, is defined for the following kinetic mechanism as k 1 k 2 E + S ES E + P k -1

BENG 183 Trey Ideker. Protein Sequencing

CHEM 3653 Exam # 1 (03/07/13)

Bioinformatics: Network Analysis

Potentiometric Titration of an Amino Acid. Introduction

Introduction to graph theory and molecular networks

A Theoretical Inference of Protein Schemes from Amino Acid Sequences

PROTEIN SECONDARY STRUCTURE PREDICTION USING NEURAL NETWORKS AND SUPPORT VECTOR MACHINES

ANSWERS TO CASE STUDIES Chapter 2: Drug Design and Relationship of Functional Groups to Pharmacologic Activity

Basic Principles of Protein Structures

A modular Fibonacci sequence in proteins

Lecture 14 Secondary Structure Prediction

Collision Cross Section: Ideal elastic hard sphere collision:

Using an Artificial Regulatory Network to Investigate Neural Computation

Supplementary Table S1. Pathological analysis for non-synonymous sequence changes

CHEMISTRY ATAR COURSE DATA BOOKLET

Module No. 31: Peptide Synthesis: Definition, Methodology & applications

In eukaryotes the most important regulatory genes contain homeobox sequences and are called homeotic genes.

Modelinig and Simulation of Amino Acide

Separation of Large and Small Peptides by Supercritical Fluid Chromatography and Detection by Mass Spectrometry

Analysis of Relevant Physicochemical Properties in Obligate and Non-obligate Protein-protein Interactions

LS1a Fall 2014 Problem Set #2 Due Monday 10/6 at 6 pm in the drop boxes on the Science Center 2 nd Floor

ARE BINDING RESIDUES CONSERVED?

7 Protein secondary structure

Biophysical Society On-line Textbook

8 Protein secondary structure

A Model for Protein Secondary Structure Prediction Meta - Classifiers

12 Protein secondary structure

Principles of Biochemistry

Transcription:

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 2.1Introduction smentionedinchapter1,severalmethodsareavailabletoclassifyhuman missensemutationsintoeitherbenignorpathogeniccategoriesandthese methodsusedifferentattributesrelatedtomissensemutationssuchas sequencebased(ngandhenikoff,2001,2003;ferrercostaetal.,2004;thomasetal., 2004;Capriottietal.,2006;Tianetal.,2007;Calabreseetal.,2009),evolutionarybased (Fayetal.,2001),physiochemicalproperties(StoneandSidow,2005),combinationof structuralandevolutionaryinformation(sunyaevetal.,2001;chasmanandadams, 2001;FerrerCostaetal.,2002;BrombergandRost,2007;Ramenskyetal.,2002;Stitziel etal.,2004;reumersetal.,2005;cavalloandmartin,2005;baoetal.,2005;ferrer Costaetal.,2005;Yueetal.,2006;Matheetal.,2006;Lietal.,2009;Adzhubeietal., 2010). Recently,aSVMbasedmethodhasbeendevelopedwhichusesanewsetoffeatures which refer to as neutraldisease missense mutation discriminatory (NDMSMD) features.thendmsmdfeaturesincludepositionspecificprobabilityscorescalculated usingdirichletmixtureofpriorinformationaswellasgribskov sapproach,predicted solventaccessibilityandsecondarystructuralfeatures,blosum62substitutionscores andchangeinfreeenergychangesassociatedwithbothwildtypeandmutantamino acidresidues.beforeutilizingthese10featuresinsvm,thedistributionpatternsofeach Page38

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... ofthefeaturesinknowndiseaseandneutralmissensemutationswerestudiedandthe detailsaregiveninthischapter. 2.2MaterialsandMethods 2.2.1 The Neutral Disease MisSense Mutation Discriminatory (NDMSMD) Features ThetenNDMSMDfeaturesincludesixpositionspecificfeatures,twoproteinstructure basedandtwoaminoacidresiduebasedfeatures(table2.1).thedetailsofcalculation ofthesefeaturesaregivenbelow. 2.2.2ThePositionSpecificFeatures Thesefeaturescorrespondtopositionspecificpreferencesofaminoacidresiduesatthe missensemutationsites.oneofthefeaturesisthepositionspecificprobabilityscore andtheotheristhegribskov sscore(gribskovetal.,1987). scores (Tianetal.,2007)werecalculatedusingDirichletmixtureasthepriorinformation availableintheformof20componentdirichletmixtures(dms)aswellastheobserved aminoacidfrequenciesfromthemultiplesequencealignmentofthetargetprotein sequenceanditshomologues(sjölanderetal.,1996,tianetal.,2007).thedetailsof calculationaregivenbelow. Page39

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... Table2.1:Thelistof10NDMSMDfeaturesanalyzedinthepresentstudy. Features Positionspecific Sequence & Structure based Amino acid residue based Description Positionspecificprobabilityscoreofthewildtypeaminoacid residues( @ Positionspecificprobabilityscoreofthemutanttypeaminoacid residues( ) @ Differencebetweenposition specificprobabilitiesscoreofthe Wild and the Mutant amino acid residues i.e., diff Gribskov sscoreofthewildtypeaminoacidresidues * Gribskov s Score of the mutanttype amino acid residues * DifferencebetweenGribskov sscoreofthewildtypeandthe mutantaminoacidresidues i.e.,diff( ) Solventaccessibilitystatusoftheaminoacidatthemutation site&&;1ifitisburied(solventaccessiblesurfaceareais<10%); 0ifitisexposed. Secondarystructuralstatusoftheaminoacidresidueatthe mutationsite**;1ifitisapartofalphahelix;2ifitisapartof extendedstrandor0forothertypes. Difference in transfer free energy values of wild type and mutatedtypefrominsidetosurfaceoftheprotein@@ BLOSUM62Substitutionscoresfor WildtypeMutatedTypeaminoacids. @These scores were calculated using the perl script psap.pl available at http://www.mobioinfor.cn/parepro/(tianetal.,2007). *ThesescoreswerecalculatedusingthePROPHECYprogramavailableintheEMBOSS suite(riceetal.,2000). &&Solventaccessibility calculatedfromaccpro4.0(chengetal.,2005). **Secondarystructureprediction calculatedfromssprov4.5(chengetal.,2005). @@Transferinfreeenergyvalues frominsidetooutsideofaglobularprotein(janin, 1979). Page40

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... isanestimationofthepositionspecificprobabilityofaminoacid a atthemis sensemutationsite b ofagivenhumanproteinsequence(query)andiscalculatedas follows (2.1) Where (2.2) Where isthemixturecoefficientofthecomponent m fromadirichletmixtureof priors, isthecorrespondingalphaparameter, t isthetotalnumberofcomponents whichis20, isthebetafunction,istheaminoacidcountvectorthatisobservedin themultiplesequencealignmentofthehumansequenceanditshomologues. And in the equation 2.2 is calculated by multiplication of total number of sequences S with (2.3) Where issumofthehenikoff spositionspecificweights(henikoffandhenikoff, 1996)calculatedforaminoacid a ataparticularposition b withoutgapsinthe multiplesequencealignmentofthequeryanditshomologues. (2.4) Page41

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... Where isthefrequencyofanyaminoacid a observedinthenature(jonesetal., 1992)andistheweightedcolumnobtainedfromthemultiplesequencealignment ofthequeryanditshomologueswithnogapsandiscalculatedasgivenbelowand calculatedasinequation2.5exceptitcontaingaps (2.5) Where N isthetotalnumberofalignedsequencesand (2.6) Where isthesequenceweightattheposition b ; isthenumberofdifferent aminoacidresiduetypesthatarepresentatposition b inthealignmentand isthe numberoftimesthataparticularresiduefromqueryproteinappearsinthealignment column. Gribskov sscore(gribskovetal.,1987)wascalculatedusingtheprophecy programavailableintheembosssuite(riceetal.,2000)usingmultiplesequence alignmentsusedforcalculating.theisdefinedasaweightedaverageofthe similarityscoresforanaminoacid a attheposition b, (2.7) Page42

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... WhereSM(a,aa)isthesimilaritymatrix(BLOSUM62)scoreforaminoacid a replacing anotheraminoacid aa and (2.8) Where isdefinedasthenumberofaminoacidresidues aa dividedbythetotal numberofaminoacidresiduespresentataposition b includinggapspresentinthe alignmentcolumn. Themultiplesequencealignments(MSAs)ofhumanproteinwithitshomologueswere calculatedusingclustalw2(chennaetal.,2003). The humanproteinhomologswere identifiedusingpsiblastsearchesagainstthenonredundant(nr)databasewithane value 1e15 (Mooney and Klein, 2002) with threefour rounds of iteration until convergenceisreached. Whilesearchingforhomologs,onlytherelevantdomaincontainingagivenmutation was given as the query. Domain boundaries were identified using ProDom (http://profom.prabi.fr/profom/current/html/home.php; Bru et al., 2005). From PSI BLASThits,humanproteinswereremoved.Furthermore,thehomologuesshorterthan 70%ofthequerysequencelengthwerealsoremovedbeforedoingtheMSAswith CLUSTALW2. Page43

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 2.2.3StructurebasedFeatures Twostructurebasedfeatureswereconsidered.Theyarethesolventaccessibilitystatus andsecondarystructuralstatus.accpro(chengetal.,2005)wasusedtopredictthe solventaccessibilityvaluesofaminoacidresidues.predictedsolventaccessibilityvalue< 10%wastakenasindicativeofaburiedresiduewhereasavalue>10%wasconsidered asexposedresidue.ssprov4.5ofthescratchsuite(chengetal.,2005)wasusedto predictthesecondarystructurestatusoftheaminoacidresidues. 2.2.4Freeenergyandpairwisesubstitutionscores Thedifferenceintransferfreeenergyvalues(Table2.2)frominsidetosurfaceofthe protein(janin,1979)ofwildtypeandmutantaminoacidswasusedasoneofthetwo amino acid features. The other feature was the pairwise substitution scores of BLOSUM62substitutionmatrix(HenikoffandHenikoff,1992)(Figure2.1)forwildtype andmutantresidues. 2.3Thedatasetofknownneutralanddiseasemissensemutations As mentioned in Chapter 1, various databases are available for disease/neutral variations.forthepurposesofthepresentanalysisandalsoforthetrainingandtesting ofsvmasreportedinchapter3,acurateddatasetknownashumvardataset(capriotti etal.,2006;tianetal.,2007;adzhubeietal.,2010)wasused.thisdatasethasbeen derivedfromtheswissprot/uniprotdatabase(capriottietal.,2006)andcomprisesof Page44

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... two categories of missense mutations viz., Disease and Polymorphism. This datasetatthetimeofthestudycomprisedof13,032 Disease from1111proteinsand 8,946 Polymorphism from3484proteins.theresultsreportedinthepresentstudyas wellasintheotherchapterspertaintothisdataset.recently,thedatasethasbeen revised and comprises of 12,598 Disease from 1101 proteins and 8,638 Polymorphism from 3399 proteins. All the missense mutations designated as PolymorphismwereconsideredasNeutralinthepresentstudy.Themainreasonfor usingthisdatasetinthepresentstudystemsfromthefactthatthisdatasethasbeen usedasabenchmarksetbyothergroups(capriottietal.,2006;tianetal.,2007; Adzhubeietal.,2010)whiledevelopingtheirmethodsforpredictionofpathogenic effectsofmissensemutations.thehumvardatasetisavailableattwowebsites:(a) ftp://genetics.bwh.harvard.edu/datasets/pph2/humvar2.0.17.tar.gz (Azdubehi et al., 2010)and(b)http://gpcr.biocomp.unibo.it/~emidio/PhDSNP/HumVar.txt(Capriottiet al.,2006).theoneusedinthepresentthesiswasdownloadedfromthesite(a). Page45

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... Table2.2:Transferfreeenergyvaluesof20aminoacidsfrominsidetooutsideofa protein(takenfromjanin,1979). Aminoacids Alanine Arginine Asparagine Aspartate Cysteine Glutamine Glutamate Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Trypsin Tyrosine Valine Freeenergyoftransferfrominsidetooutside ofaprotein 0.3 1.4 0.5 0.6 0.9 0.7 0.7 0.3 0.1 0.7 0.5 1.8 0.4 0.5 0.3 0.1 0.2 0.3 0.4 0.6 Page46

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4-1 -2-2 0-1 -1 0-2 -1-1 -1-1 -2-1 1 0-3 -2 0-2 -1 0-4 R -1 5 0-2 -3 1 0-2 0-3 -2 2-1 -3-2 -1-1 -3-2 -3-1 0-1 -4 N -2 0 6 1-3 0 0 0 1-3 -3 0-2 -3-2 1 0-4 -2-3 3 0-1 -4 D -2-2 1 6-3 0 2-1 -1-3 -4-1 -3-3 -1 0-1 -4-3 -3 4 1-1 -4 C 0-3 -3-3 9-3 -4-3 -3-1 -1-3 -1-2 -3-1 -1-2 -2-1 -3-3 -2-4 Q -1 1 0 0-3 5 2-2 0-3 -2 1 0-3 -1 0-1 -2-1 -2 0 3-1 -4 E -1 0 0 2-4 2 5-2 0-3 -3 1-2 -3-1 0-1 -3-2 -2 1 4-1 -4 G 0-2 0-1 -3-2 -2 6-2 -4-4 -2-3 -3-2 0-2 -2-3 -3-1 -2-1 -4 H -2 0 1-1 -3 0 0-2 8-3 -3-1 -2-1 -2-1 -2-2 2-3 0 0-1 -4 I -1-3 -3-3 -1-3 -3-4 -3 4 2-3 1 0-3 -2-1 -3-1 3-3 -3-1 -4 L -1-2 -3-4 -1-2 -3-4 -3 2 4-2 2 0-3 -2-1 -2-1 1-4 -3-1 -4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5-1 -3-1 0-1 -3-2 -2 0 1-1 -4 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 0-2 -1-1 -1-1 1-3 -1-1 -4 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6-4 -2-2 1 3-1 -3-3 -1-4 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7-1 -1-4 -3-2 -2-1 -2-4 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 1-3 -2-2 0 0 0-4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5-2 -2 0-1 -1 0-4 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 2-3 -4-3 -2-4 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7-1 -3-2 -1-4 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4-3 -2-1 -4 B -2-1 3 4-3 0 1-1 0-3 -4 0-3 -3-2 0-1 -4-3 -3 4 1-1 -4 Z -1 0 0 1-3 3 4-2 0-3 -3 1-1 -3-1 0-1 -3-2 -2 1 4-1 -4 X 0-1 -1-1 -2-1 -1-1 -1-1 -1-1 -1-1 -2 0 0-2 -1-1 -1-1 -1-4 * -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4 1 Figure2.1:BLOSUM62aminoacidsubstitutionscores(HenikoffandHenikoff,1992). Page47

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 2.4ResultsandDiscussion 2.4.1Distributionofpositionspecificfeatures 2.4.1.1Scores Itshouldbenotedthatthescores andforagivenaminoacidresidue a at position b essentiallyindicateitspreferenceorotherwiseinaprotein.preferenceis indicatedbyscoresof >0.05and>0.0andnonpreferenceisindicatedby scoresof <0.05and<0.0.Needlesstomentionallwildtype(WT)amino acidresiduesaswellasneutralmutationsshouldthereforeshowscorescorresponding topreferenceswhereasdiseasemutationsshouldshowscorescorrespondingtonon preferences. Thedistributionofscoresfordiseasemutationsaswellasneutralmutationsfromthe HumVardatasetshasbeeninvestigated.(Figure2.2(a)).Amongthediseasemutations, about91%inhumvardatasetsshowedscores<0.05indicatingtheexcellentutilityof thesepositionspecificscoresforpredictingdiseasemutations.however,inthecaseof neutralmutations,only43%showedscores>0.05.thisindicatedthatnotalltheneutral mutationsarethepreferredsubstitutions.tomakesurethatscoresindicatepreferred aminoacidresiduescorrectly,ihaveinvestigatedthedistributionof scoresforthe wildtype(wt)aminoacidresiduesfromthehumvardatasets(figure2.2(b)).more than 95% of the WTs show scores >0.05. The fact that about half of the neutral Page48

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... mutations of the HumVar dataset show scores <0.05 (indicative nonpreference) indicate that these may be the mutations occurring infrequently though their phenotypic effect is benign. This could also be due to annotation mistakes in the SWISSPROTdatabasefromwhereHumVardatasetwascurated. Thedistributionofdifferencesbetween scoresofwildtype(wt)andmutanttype (MT)(Figure2.2(c))wasalsoexaminedwhichshowedtwodistinct(butoverlapping) distributionsfordiseaseandbenignmutations. 2.4.1.2Scores Ascorelessthan0indicatesthatresidue a atposition b isnotpreferredand henceisindicativeofapathogenicmutation.ontheotherhand,ascore>0.0indicates preferenceandisindicativeofaneutralmutation.fromfigure2.3(a),itcanbeseen that77%ofthediseasemutationscorrespondtoscores<0.0andabout50%ofneutral mutations have scores >0.0. Although this result indicates thatscore is less discriminating than however, it is still a good discriminator. For comparison purposes,distributionofwildtypeaminoacidresidueswasexamined(figure2.3(b)). About 90% of them showed values >0 further indicating utility of the score. The differencescoresbetweenofwildtypeandmutanttype(figure2.3(c))show distinctbutoverlappingdistributionfordiseaseandbenignmutations. Page49

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 100% 90% 91% 80% 70% 60% 57% 50% 40% 43% 30% 20% 10% 9% 0% MT<0.05 MT>0.05 NEUTRAL DISEASE Figure2.2(a):Thedistributionof scoreofmutatedaminoacid(mt)fordiseaseand benignmutationsinthehumvardatasets. Page50

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 100 96.8 90 80 70 60 50 40 30 20 10 0 3.2 WT<0.05 WT>0.05 Percentage Figure2.2(b):Thedistributionof scoreofwildtypeaminoacid(wt)forhumvar datasets. Page51

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 100 90 80 70 percentage 60 50 40 30 20 10 0 N D diffpab(wtmt) Figure2.2(c):Distributionofdifferencein (WT)and (MT)scoresformissense mutationsinhumvardataset.n=neutralandd=diseasemutations. Page52

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 100% 90% 80% 77% 70% 60% 50% 40% 50% 47% 30% 20% 23% 10% 0% MT<0 MT>0 NEUTRAL DISEASE Figure2.3(a):DistributionofscoresforDisease(D)andNeutral(N)mutations. N=NeutralandD=Diseasemutations. Page53

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 100 90 90 80 70 60 50 40 30 20 10 0 10 WT<0 WT>0 Percentage Figure2.3(b):Distributionofscorescalculatedforthewildtype(WT)amino acidresiduesinthehumvardataset. Page54

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 100 90 80 70 percentage 60 50 40 N D 30 20 10 0 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 diffgab(wtmt) Figure2.3(c):Distributionofdifferencein(WT)and(MT)scoresforthe mutationsinhumvardataset(n=neutralandd=diseasemutations). Page55

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 2.4.2StructureandAminoacidbasedfeatures. 2.4.2.1Solventaccessibilitystatusofthemissensemutations Asalreadymentioned,thesolventaccessibilityvaluesofthemissensemutationswere predictedusingaccproprogram(chengetal.,2005)availableasapartofscratch package.ifthepredicted%accessiblesurfaceareaislessthan10thenthemutationwas consideredasburiedandifthe%accessiblesurfaceareaismorethan10thenthe mutationwasconsideredexposed. Figure2.4showsthedistributionoftheHumVarmutationsintoburiedandexposed categories.about77%oftheneutralmutationsarepredictedtobeexposedtosolvent indicatingthatthemutationsoftheexposedresidueshaveaneutraleffect.among diseasemutationsabout52%arepredictedtobeburiedandtheremaining(48%)are predictedtobeexposedtosolvent.onewouldhaveexpectedamajorityofdisease causingmutationstoburiedaschangesattheburiedpositionwoulddestabilizeprotein structureleadingtopathogeniceffects(wangandmoult,2001;sunyaevetal.,2000; GongandBlundell,2010).However,thecurrentanalysisindicatesthatevenmutations attheexposedpositionstoocanleadtopathogeniceffects(kowarschetal.,2010).this resultisnotentirelysurprisinggiventhefactthatsomeofthesurfacepositionscould form parts of binding surfaces involved in intermolecular interactions and hence mutationsatsuchfunctionallyimportantsiteswouldinvariablyyieldpathogeniceffects. Page56

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 90 80 70 60 percentage 50 40 30 NEUTRAL DISEASE 20 10 0 Buried Exposed StatusofSolventAccessibility Figure 2.4: The distribution of status of mutations as buried or exposed for the mutationsinthehumvardatasets. Page57

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 2.4.2.2Secondarystructuralstatusofthemissensemutations Distributionofdiseaseandneutralmutationsinthethreesecondarystructurestates viz.,helices,strandsandcoilregionsareshowninfigure2.5.ascanbeseen, diseaseaswellasneutralmutationsdoshowsomelocalizationtendency.forexample, diseasemutationsarepredictedtobelocalizedmoreonhelicesandbetastrandsthan neutral mutations. Comparatively, coil regions harbor more number of neutral mutationsthandiseasemutations. 2.4.2.3Aminoacidbasedfeatures. Differenceinfreeenergytransfervaluesbetweenthewildtypeandthemutationwas calculatedforeachentryinthehumvardataset.asmentionedalready,thefreeenergy transfervaluesofthe20aminoacidresidues(table2.2)weretakenfromjanin(1979). Althoughforeveryintervalstartingfrom2.0to+2.0(Figure2.6),bothdiseaseand neutral mutations can be seen, however, their proportions vary along the entire spectrum. The proportion of disease mutations increases at every interval for differences greater than 0.2 indicating energetically unfavoured situations in those cases. DistributionofBLOSUM62substitutionscoresforbothdiseaseandneutralmutations (Figure2.7)werealsoexamined.Ascoreof<1and>+1indicatespreferredandnon Page58

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 60 50 40 percentage 30 20 NEUTRAL DISEASE 10 0 Helix Strand Rest StatusofSecondaryStructure Figure2.5:ThedistributionofsecondarystructurestatusfortheHumVardatasets. Page59

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 30 25 20 perecntage 15 10 NEUTRAL DISEASE 5 0 2 1.5 1 0.5 0 0.5 1 1.5 2 Difffreeenergy(WTMT) Figure2.6:Thedistributionoftransferfreeenergyvaluesfrominsidetosurfaceofthe proteinfordifferenceinfreeenergyfrominsidetooutsideoftheprotein. Page60

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... 70 60 50 percentage 40 30 Neutral Disease 20 10 0 <1 >1 BLOSUM62SCORE Figure 2.7: The BLOSUM62 substitution score distribution for disease and neutral mutations Page61

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... preferred substitutions respectively. Of the disease mutations, 65% correspond to scores<1whereas60%ofneutralmutationscorrespondtoscores>+1.thisresult reconfirms the observations made by earlier reports (Cargill et al., 1999; Balasubramanianetal.,2005). 2.5Summary Inthischapter,Ihaveinvestigatedsomesequenceandstructurebasedfeaturestobe usedinthesvmbasedmethodrelatedtothediseaseandbenignmutations.ihave used 10 new set of features refer to as neutraldisease missense mutation discriminatory(ndmsmd)features.thetenndmsmdfeaturesincludesixposition specific features, two protein structure based and two amino acid residue based features.thisstudyalsorevealedthediscriminatorypowerofalltenindividualfeatures tobeusedinthepredictionofpathogenicmutations.thenextchapterdealswiththe usageof10ndmsmdfeaturesinthenewlydevelopedsvmbasedmethodforthe predictionofpathogenicmutations. Page62