Jessica Wehner. Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008

Size: px

Start display at page:

Download "Jessica Wehner. Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008"

Easter Beasley
5 years ago
Views:

1 Journal Club Jessica Wehner Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008

2 Comparison of Probabilistic Combination Methods for Protein Secondary Structure Prediction Yan Liu, Jaime Carbonell, Judith Klein-Seetharaman and Vanathi Gopalakrishnan Bioinformatics Vol. 20 no , pages

3 Primary Topics of the Paper Protein secondary structure prediction Probabilistic combination of different predictions Assessment of different methods 3

Secondary Structure Prediction Predict the label of

MAGKPVLMYNDGRGRMFPVRWL proteins l 1 l 2 l 3 l i l

4 Secondary Structure Prediction Predict the label of each residue to be one of H, E or C. Prediction is: cheaper and faster than experimental methods applicable to all x 1 x 2 x 3 x i x 22 MAGKPVLMYNDGRGRMFPVRWL proteins l 1 l 2 l 3 l i l 22 helpful in the design of therapeutic drugs HHCCEEEECCEEEECCHHHCCC 4

5 Initial Prediction Outputs Output is one of: Labels: x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 Score Matrix: E H 0.9 H 0.6 H 0.1 C E E E C H C C C Typical initial prediction algorithms include PSI-BLAST, neural networks, recurrent neural networks, Support Vector Machines (SVMs) and Hidden Markov Models (HMMs). 5

6 Probabilistic Combination Methods Window Based Methods Using a label sequence Using a score matrix CCCCHHEEECEEEEHHHCCC Graphical Chain Models CCCCHHEEECEEEEHHHCCC Maximum Entropy Markov Models (MEMM) Higher Order MEMMs (HOMEMM) Pseudo State Duration MEMMs (PSMEMM) Conditional o Random Fields (CRF) All of these use score matrix 6

7 Probabilistic Combination (Window Based) With Labels Labels: l 1 l 2 l 3 l N Let w = 7. Then window is <l 7, l 8, l 9, l 10, l 11, l 12, l 13 > A rule-based classifier is then applied (decision trees) With Scores Have score matrix Window would be <[S H (x 7 ), S E (x 7 ), S C (x 7 )],, [S H (x 13 ), S E (x 13 ), S C (x 13 )]> More powerful classifier is then applied (neural networks, k-nearest-neighbor, SVM) 7

8 Probabilistic Combination (Graphical Models) Define features Scores: Transition: Final labels given by where, MEMMs: Include information of previous residue, y i-1 8

9 Probabilistic Combination (Graphical Models) HOMEMMs (2 nd order): Include information about two previous residues f f score j is same PSMEMMs: Avoid computational costs of HOMEMMs Use observation that segment length varies for structures Now have learned from 9

10 Probabilistic Combination (Graphical Models) CRFs: Use global normalization Z 0 to avoid label bias problem Score and transition features are similar to MEMMs CRFs define and 10

11 Analysis of Combination Methods Dataset: A set of 513 protein chains of known structure Generated initial labels and scores using PSI-BLAST Implementation: Window-based Decision tree algorithm for Label Combination Support Vector Machines (SVMs) for Score Combination 4 Graphical Models (MEMM, HOMEMM, PSMEMM, CRF) Evaluation measures: Average accuracy over all residues: Q 3 Average accuracy over each structure type: Q H,Q E,Q C, Correlation between true and predicted of each structure type: C H, C E, C C 11

12 Results Table 1. Analysis of combination methods. Combination Q 3 Q H Q E Q C C H C E C C Method None Dtree SVM MEMM HOMEMM PSMEMM CRF CRFs improved predictions for helices and sheets Score Combination better than Label Combination Graphical model methods are better than window-based methods 12

13 Insights and Observations Continued improvements in secondary structure t prediction are valuable. Using scores rather than labels appears more effective and not difficult, so it seems like it should be commonly implemented. The discovery that graphical models improve over window-based methods is a trade-off, since graphical models are more computationally taxing. Initially the recorded improvements did not seem statistically significant to me, however it is likely that in this field even small changes are greatly advantageous. 13

14 Acknowledgements Dr. Judy Weiber Dr. Madhavi Ganapathiraju Bill Glassford Davis Buenger 14

15 Questions? 15

16 Fig. 1 Output from PSI-BLAST Fig. 3 Decision tree. Fig. 2 Structure segment length distribution. 16

Conditional Graphical Models

PhD Thesis Proposal Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute University Thesis Committee Jaime Carbonell (Chair) John Lafferty Eric P. Xing