Protein Secondary Structure Prediction Doug Brutlag & Scott C. Schmidler
Overview Goals and problem definition Existing approaches Classic methods Recent successful approaches Evaluating prediction algorithms Shortcomings of existing approaches Current research
Protein structure prediction Sequence of 984 amino acids: PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRETKLGKAGYVT NKGRQKVVPLTNTTNQKTELQAIYLALQDSGLEVNIVTDSQYALGIIQAQP DKSESELVNQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGI PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVKQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQ HIV reverse transcriptase 3D coordinates of 7404 atoms:
Abstracting the problem 3D coords of all atoms: 3D coords of C-α backbone: 3D coords of secondary structure elements: C-α groups:
Secondary structure prediction for protein folding Sequence of amino acids: Predict structural segments: Goal: Recover 3D coords PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEK EGKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQD FWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTA FTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFK KQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLLRWGLT TPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQ KLVGKLNWASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAEL ELAENREILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQE PFKNLKTGKYARMRGAHTNDVKQLTEAVQKITTESIVIWGKTP KFKLPIQKETWETWWTEYWQATWIPEWEFVNTPPLVKLWYQLE KEPIVGAETFYVDGAANRETKLGKAGYVTNKGRQKVVPLTNTT NQKTELQAIYLALQDSGLEVNIVTDSQYALGIIQAQPDKSESE LVNQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGI PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEK EGKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQD FWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTA FTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFK KQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLLRWGLT TPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQ KLVGKLNWASQIYPGIKVKQLCKLLRGTKALTEVIPLTEEAEL ELAENREILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQE PFKNLKTGKYARMRGAHTNDVKQLTEAVQKITTESIVIWGKTP KFKLPIQKETWETWWTEYWQATWIPEWEFVNTPPLVKLWYQ The secondary structure prediction problem: Given a protein sequence: NWVLST VLSTAADMQGVVTDGMASGLDKD... D... Predict a secondary structure sequence: LLEEEE EEEELLLLHHHHHHHHHHLHHHL... H = α-helix E = Extended β- strand L = Loop/coil
Defining the secondary structure of a protein sequence α-helix and anti-parallel β sheet: Residue Sequence: Secondary Structure: NWVLSTAADMQGVVTDGMASFLDKD...... LLEEEELLLLHHHHHHHHHHLHHHL Fig. 1: Syntactic formulation of secondary structure problem
Abstracted version of protein structure prediction Sequence of 984 amino acids: 3D coords of 179 structural elements: PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRETKLGKAGYVT NKGRQKVVPLTNTTNQKTELQAIYLALQDSGLEVNIVTDSQYALGIIQAQP DKSESELVNQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGI PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVKQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQ
The Secondary Structure Prediction problem Given a protein sequence: NWVLST VLSTAADMQGVVTDGMASGLDKD... D... Predict a secondary structure sequence: LLEEEE EEEELLLLHHHHHHHHHHLHHHL... 3-state problem: {ARNDCQEGHILKMFPSTWYV} n -> {L,H,E} n
Chou-Fasman method (Chou & Fasman, 1974) Calculate propensity for each amino acid to be in helix, strand, coil ( ) P( A) P A S = n A,S n S n A n Classify by propensity
Chou-Fasman prediction Search for nucleation sites helix nucleation: score > 4 in window of 6 where Propagate to until termination criteria met helix termination: tetrapeptide with mean propensity < 1 Rules for conflict resolution, exceptions Accuracy: 70-80% (< 55% Nishikawa, 83)
GOR (Garnier et al,1978) Calculate information content for each amino acid: I( S; R) = log[ P( S R) P( S) ] (same as CF propensity ) Information difference ( ) = I( S;R) I ( S;R) ( ) P ( S, R) I S; R = log P S, R (likelihood ratio) [ ] + log[ P ( S) P( S) ] Predict max using window of size 17 Accuracy: 64% (< 55% Nishikawa, 83)
Window-based prediction For each position in a protein sequence:...lstaadmqgvvtdgmasgldkd... TDGMASGLDKD... Predict its secondary structure based on a local window:...lstaadmqgv MQGVVTDGMASGLDKD... Slide window along sequence:...lstaadmqgvv QGVVTDGMASGLDKD......LSTAADMQGVVT GVVTDGMASGLDKD... GLDKD......LSTAADMQGVVTD VVTDGMASGLDKD...
Modeling structural correlations NAIVE-BAYES CLASSIFIER Conditional independence models: { L,H,E } A R N D C Q E G H I L K M F P S T W Y P(A H) V... P(A E) P(A L) HELIX STRAND LOOP i-4 i-3 i-2 i-1 i i+1 i+2 i+3 i+4 Pair-wise dependence: A R N D A R N D C Q E G H I L K M F P S T W Y V * * * KLINGER : STRUCTURAL CORRELATIONS... i i+1 i+2 i+3 i+4 W Y V
Hydrolase (β-lactamase)( with amphipathic α-helix
Amphipathic α-helix: hydrophobic side chains
Amphipathic α-helix: side chain periodicity Sequence: NLAKMVVKTAEAILKD
Structural correlations in β-strands 5 4 3 2 1 0-1 -2-3 -4 E E E E
PHD (Rost & Sander, 1993) MULTI-LAYER PERCEPTRON (FULLY-CONNECTED) Neural network based 2 levels: Sequence -> Structure Structure -> Structure i-4 Uses multiple sequence alignment Amino acid frequencies Structure Predictions -> Conservation weight Post-processing by dynamic programming i-3 i-2 H L E...... i-1 i i+1 i+2 i+3 i+4 <-...Amino Acid Sequence... ->
Special case: helical transmembrane proteins Membrane proteins biologically important Difficult to determine experimentally Easier to predict Constraints imposed by lipid bilayer Strong hydrophobicity signal Cytoplasmic residues positively charged 2-state bacteriorhodopsin Accuracy: 95% (Multiple alignment)
Predator (Frishman & Argos, 1996) Nearest-neighbor classifier Represent subsequence as a vector Define a distance metric Find closest vectors in training set, vote Adds non-local terms for hydrogen bonding propensity (helices, sheets) Accuracy: 68% single sequence; 75% multiple alignment
SSPAL (Salamov & Solovyev, 1997) Nearest-neighbor can be viewed as fixed length, non-gapped local alignment Find K (= 50) best non-overlapping local alignments with known structures Predict each position by consensus of alignment structures, weighted by score Accuracy: Single sequence: 71.2% Multiple sequence alignment: 73.5%
Evaluation of secondary structure prediction Large database of protein sequences: Known structures X-ray crystallography, NMR Gold-standard assignment Non-homologous < 25-30% identity Cross-validation
What to measure? Q3 (3-state accuracy) percent residues correct Matthews correlation coefficient adjust for prevalence Segment-based measures Rost et al 94, Taylor 84, Presnell et al 92
Bayesian Segmentation of Protein Secondary Structure Scott C.Schmidler 1,3 Jun S. Liu Douglas L. Brutlag 2 3 1 2 3 Section on Medical Informatics Department of Statistics Department of Biochemistry Stanford University
Goals Improved secondary structure prediction > 70% accuracy (75% MSA) accurate estimates of prediction variability Combining structural data with scientific knowledge
Bayesian structure prediction Model-based structure prediction Probabilistic modeling of segments Hydrophobicity patterns Side chain interactions Helical capping Predict structure to maximize probability Optimal segmentation of protein sequence L E L H L H L Doug Brutlag, 2000... NW VLST AADM QGVVTDGMAS F LDK D...
Model Joint distribution: P( R, S,T ) = P( S,T ) P R [ S j 1 +1:S j ] S, T Conditional independence model for inter- segment residues: Markovian dependence in S,T: ( ) = P T j T j 1 P R, S,T m j =1 m j =1 ( )P( S j S j 1,T j )P R Sj 1 +1:S j [ ] S j 1, S j,t j Example: L E L H L H L R = NWVLSTAADMQGVVTDGMASFLDKD SS = LLEEEELLLLHHHHHHHHHHLHHHL
Position-specific preferences Helix N-cap model, positions 1&2 Helix internal position model 0.18 0.18 0.16 0.16 0.14 0.14 0.12 0.12 0.1 0.1 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 A R N D C Q E G H I L K M F P S T W Y V X 0 A R N D C Q E G H I L K M F P S T W Y V Amino acid Amino acid Strand internal position model Loop/coil N-cap model, positions 1&2 0.18 0.18 0.16 0.16 0.14 0.14 0.12 0.12 0.1 0.1 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 A R N D C Q E G H I L K M F P S T W Y V 0 A R N D C Q E G H I L K M F P S T W Y V Amino acid Amino acid
Segment likelihood Modeling correlations among properties j ( ) = P Helix R [i] H i P R [k: j] S q 1,S q, Helix P R [k: j] S q 1,S q,strand i = k ( ) = P Strand ( R [i ] H i ) i = k ( ) = P Loop R [i ] H i P R [k: j] S q 1,S q, Loop j ( )P Helix H i H i 2, H i 3,H i 4, H i 7 ( ) P Strand H i H i 2, H i 3 j i =k ( )P Loop H i H i 2 ( ) ( ) H1 H2 H3 H4 H5 H6 R1 R2 R3 R4 R5 R6 HELIX MODEL
Segment length distributions 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 Helix Strand Segment length
Probabilistic model Structureand positionspecific frequencies: P( Sequence Structure) P( Structure) 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 Segment length priors: Helix Strand Segment length Conditional independence of inter-segment residues: L E L H Markovian dependence in segment types: L H L
Benefits Explicit probabilistic model: Sound semantics Coherent treatment of uncertainty/noisy data Explicit, testable assumptions Fully Bayesian prediction: Averaged over all possible models Accurate estimates of uncertainty
Example prediction: Cytochrome C5 (1cc5) True: Predicted:
Prediction confidence 100 95 90 85 80 75 70 65 0.4 0.5 0.6 0.7 0.8 0.9 Prediction threshold 60 50 40 30 20 10 100 90 80 70 60 50 40 30 20 10 0 0.4 0.5 0.6 0.7 0.8 0.9 Prediction threshold 0 0.4 0.5 0.6 0.7 0.8 0.9 Prediction threshold %H %E %L
Evaluation 453 structures selected from the Brookhaven Protein Data Bank < 2.5A resolution < 25% sequence similarity Cross-validation results (single sequence): Bayesian Segmentation algorithm: Marginal mode prediction: 68.8% MAP segmentation prediction: 64.2% Previous best-published: 68% (Frishman&Argos 96) 71% (Salamov&Solovyev 97)
Beyond secondary structure: predicting 3D contacts Model captures local dependencies Structure-specific residue propensities Intra-segment side chain correlations Tertiary interactions β-sheets Coiled-coils Disulfide bonds
β-sheet side chain correlations Odds ratio: P A i, A j Struct ( ) P A i Struct ( )P A j Struct ( ) Charged-pair interaction in Glutaredoxin Disulfide bonds Stabilizing pairs from (Smith & Regan, Science 1995) Hydrophobic side chains
Incorporating non-local information Segment interaction models Replace terms P ( Segment j ) and P Segment k with ( ) P( Segment j,segment k) Parallel β-sheet in 1nzy β H L β L L E L E L H L E L E but computation... L
Prediction of β-strand Contact Map for 5pti Predicted contacts: True contacts: Pairing and register of β-hairpin correctly predicted
Previous Approaches: Hubbard (1995) Predicted contacts (single sequence): Predicted contacts (multiple sequence alignment):
Flavodoxin (5nul) Predicted contacts: True contacts: β-strands well-predicted but poor specificity in strand pairing
Future work Models for subclasses of segments Environment: amphipathic/buried/exposed Structure: 3-10 helices, β, γ turns Model selection Multiple sequence alignment information
Conclusions Probabilistic modeling of protein structure Prediction by segmentation of sequence Independent segment models perform comparably to existing approaches General framework for modeling non-local interactions to predict 3D contacts