TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg

Size: px

Start display at page:

Download "TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg"

Jeffry Banks
5 years ago
Views:

1 title: short title: TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg lecture: Protein Prediction 1 (for Computational Biology) Protein structure TUM summer semester

2 Last time 2

3 3

4 Yet another transmembrane predictor? More data available Re-training old methods is viable but no one does it Less extensive machine learning Runtime 4

5 Dataset Transmembrane helices I 166 membrane protein sequences (TMP166) TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training Map to UniProt sequence using SIFTS Redundancy reduction with Uniqueprot at HVAL>0 Lomize et al., 2006, Bioinformatics Kozma et al., 2013, NAR Velankar et al., 2013, NAR Mika et al., 2003, NAR 5

6 Dataset Transmembrane helices II Inside/Outside topology assignment OPM Lomize et al., 2006, Bioinformatics 6

7 Dataset Proteins w/ and w/o signal peptides Derived from the SignalP 4.0 training set Redundancy reduced against set of 166 TMPs at HVAL>0 Redundancy reduced within at HVAL>0 Soluble: 1142 (452 w/ SP) Membrane: 299 (25 w/ SP) SP1441 7

8 Dataset Split Split into 4 subsets, maintaining distribution of TMPs, SPs and sequence lengths Use 3 sets for cross-validation, keep one for final independent evaluation (Blind set) Blind Blind TMP SP Train Train 8

9 Classification trees Given N training samples and M input features find the best recursive partitioning to predict the class labels in the leaf nodes Splitting, pruning, balancing... approaches differentiate algorithms 9

10 Classification trees example Loh, 2011, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10

11 Random forests Ensemble method: grow T trees for a forest For M input features, choose m < M For each t T: Select N training samples with replacement from all N samples At every split, choose m random features. Use the best split among those for building the tree 11

12 Random forests - Popularity Fast No black box Intuitive to interpret Good performance Jensen et al., 2011, Bioinformatics 12

13 TMSEG step 1 Initial prediction Random Forest (T = 100, m = 9) Sliding window of 19 residues (w = 19) 3 scores for each residue (0-1000): Signal peptide Transmembrane helix Soluble Scores scaled from to

14 TMSEG overview Step 1 14

15 TMSEG step 1 - Feature set I Global features: Global amino acid composition Protein length Local features: PSSM score Distance to N- and C-terminus Average hydrophobicity (Kyte-Doolittle) % hydrophobic % charged (positive & negative) w = 9 % polar 15

16 TMSEG step 1 - Feature set II Adjusting for conservation Substitutions with score > 0 = 16 Substitutions with score < 0 = 79 16

17 TMSEG step 1 - Feature set III Adjusting for conservation Amino acid composition M (PSSM>0) = 1/16 17

18 TMSEG step 1 - Feature set IV Adjusting for conservation Amino acid composition M (PSSM>0) = 1/16 Amino acid composition M (PSSM<0) = 3/79 18

19 TMSEG step 1 - Feature set V Adjusting for conservation % positive charge (PSSM>0) = 2/16 % positive charge (PSSM<0) = 8/79 19

20 TMSEG step 1 - Feature set VI Global features: PSSM 0 Global amino acid composition 2*20 Protein length (binned) 1 Local features: PSSM score 21*19 Distance to N- and C-terminus 2 PSSM 0 PSSM 0 PSSM 0 PSSM 0 Average hydrophobicity (Kyte-Doolittle) 2*1 % hydrophobic 2*1 % charged (positive & negative) 2*2 % polar 2*1 20

21 TMSEG step 2 Empirical filter Smooth scores with median filter (w = 5) Adjust scores to avoid overprediction soluble: -185 TMH: -60 Assign each residue to state with highest score Remove signal peptides with <4 residues Remove TMHs with <7 residues 21

22 TMSEG step 2 Example SEQ: M G P R A R P A L L L L... SIG: SOL: TMH: à Median filter SIG: SOL: TMH: à Adjust for overprediction SIG: SOL: TMH: OUT: S S S S S S S S S S... 22

23 TMSEG overview Step 1 & 2 23

24 TMSEG step 3 Refine TMH prediction I Neural Network (25 hidden nodes) Input: TMH segments of variable length Features: PSSM 0 PSSM 0 PSSM 0 PSSM 0 Amino acid composition 2*20 Average hydrophobicity (Kyte-Doolittle) 2*1 % hydrophobic 2*1 % charged 2*1 Segment length (exact) 1 24

25 TMSEG step 3 Refine TMH prediction II Split long TMHs ( 35 residues) into two shorter TMHs ( 17 residues) Keep two TMHs if higher average score after split Adjust TMH endpoints by up to 3 residues in either direction 25

26 TMSEG overview Step

27 TMSEG step 4 Topology prediction I Random Forest (T = 100, m = 7) Assign soluble segments to side 1 or 2 Features: PSSM 0 PSSM 0 PSSM 0 Amino acid composition 2*2*20 % positive charge 2*2*1 % abs. difference of pos. charge side1/side2 2*1 27

28 TMSEG step 4 Topology prediction II Consider only residues close to TMHs 15 residues next to TMHs and 8 residues into TMHs Predict topology of N-terminus and extrapolate If SP predicted à Residues after SP outside 28

29 TMSEG overview Step

30 Performance measures I Per-residue measures often misleading à Score by TMH segments instead Whole-protein scores: Q ok and Q top 30

31 Performance measures II r i : #correctly predicted TMHs B #observed TMHs p i : #correctly predicted TMHs B #predicted TMHs Q ok : L 100 N C 1, if pi = ri = 100% x i ; xi = G 0, else MNO 31

32 Performance measures III What is a correctly predicted TMH? Strict criteria Endpoint deviation 5 residues Overlap at least 50% 32

33 Performance measures IV t i : 100% if toplogy is correct, otherwise 0% Q top : L 100 N C 1, if ti = pi = ri = 100% y i ; yi = G 0, else MNO 33

34 Performance of TMH predictions 34

35 Performance measures TMP classification FPR: 100 # of incorrectly predicted TMPs # of soluble proteins Sensitvity: 100 # of correctly predicted TMPs # of observed TMPs Compare to a simple predictor ( Baseline ) Uses only hydrophobicity scale and positive-inside rule 35

36 TMP classification Very low misclassification rates Method TMP sensitivity TMP FPR Topology correct Misclassified in human More mistakes than TMSEG in human TMSEG 98 ± 2 3 ± 1 93 ± PolyPhobius 100 ± 0 5 ± 1 78 ± MEMSAT3 100 ± 0 28 ± 2 93 ± 4 4,313 3,755 MEMSAT-SVM 98 ± 2 14 ± 2 88 ± 5 2,253 1,695 Baseline 95 ± 3 31 ± 2 75 ± 7 5,015 4,457 36

37 Dataset of 12 new proteins How to get more data? Use what was published since starting work à Data unknown by any method From 07/2013 to 2016/02: Only 12 new TMPs published Very small dataset TMSEG predicts every TMH of the 10 recognized TMPs 37

38 Applying TMSEG to other methods I High modularity (steps 1-4) Apply steps 3 and 4 to other methods Step3: NN-based TMH prediction improvement Step4: RF-based topology prediction Can this improve other methods? 38

39 Applying TMSEG to other methods II 39

40 Potential extensions Re-entrant regions not modelled (little data) Idea: Check abnormal TMH segments for reentrant Does not switching topology increase scores? 40

Availability Debian package: http://rostlab.

41 Availability Debian package: Github: github.com/rostlab/tmseg PredictProtein: predictprotein.org Yachdav et al., 2014, NAR 41

42 Thank you Unknown source L 42

43 References Yachdav, G., Kloppmann, E., Kajan, L., Hecht, M., Goldberg, T., Hamp, T., Rost, B. (2014). PredictProtein-an open resource for online prediction of protein structural and functional features. Nucleic Acids Research, 42(Web Server issue), W Jensen, L. J., & Bateman, A. (2011). The rise and fall of supervised machine learning techniques. Bioinformatics, 27(24), Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), Lomize, M. a, Lomize, A. L., Pogozheva, I. D., & Mosberg, H. I. (2006). OPM: orientations of proteins in membranes database. Bioinformatics (Oxford, England), 22(5), Velankar, S., McNeil, P., Mittard-Runte, V., Suarez, a, Barrell, D., Apweiler, R., & Henrick, K. (2005). E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Research, 33(Database issue), D Kozma, D., Simon, I., & Tusnády, G. E. (2013). PDBTM: Protein Data Bank of transmembrane proteins after 8 years. Nucleic Acids Research, 41(Database issue), D Mika, S., & Rost, B. (2003). UniqueProt: creating representative protein sequence sets. Nucleic Acids Research, 31(13),

proteins TMSEG: Novel prediction of transmembrane helices Michael Bernhofer, 1 * Edda Kloppmann, 1,2 Jonas Reeb, 1 and Burkhard Rost 1,2,3,4

proteins TMSEG: Novel prediction of transmembrane helices Michael Bernhofer, 1 * Edda Kloppmann, 1,2 Jonas Reeb, 1 and Burkhard Rost 1,2,3,4 proteins STRUCTURE O FUNCTION O BIOINFORMATICS TMSEG: Novel prediction of transmembrane helices Michael Bernhofer, 1 * Edda Kloppmann, 1,2 Jonas Reeb, 1 and Burkhard Rost 1,2,3,4 1 Department of Informatics