TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg

title: short title: TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg lecture: Protein Prediction 1 (for Computational Biology) Protein structure TUM summer semester 09.06.2016 1

Last time 2

Yet another transmembrane predictor? More data available Re-training old methods is viable but no one does it Less extensive machine learning Runtime 4

Dataset Transmembrane helices I 166 membrane protein sequences (TMP166) TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training Map to UniProt sequence using SIFTS Redundancy reduction with Uniqueprot at HVAL>0 Lomize et al., 2006, Bioinformatics Kozma et al., 2013, NAR Velankar et al., 2013, NAR Mika et al., 2003, NAR 5

Dataset Transmembrane helices II Inside/Outside topology assignment OPM Lomize et al., 2006, Bioinformatics 6

Dataset Proteins w/ and w/o signal peptides Derived from the SignalP 4.0 training set Redundancy reduced against set of 166 TMPs at HVAL>0 Redundancy reduced within at HVAL>0 Soluble: 1142 (452 w/ SP) Membrane: 299 (25 w/ SP) SP1441 7

Dataset Split Split into 4 subsets, maintaining distribution of TMPs, SPs and sequence lengths Use 3 sets for cross-validation, keep one for final independent evaluation (Blind set) Blind Blind TMP166 41 SP1441 285 Train Train 8

Classification trees Given N training samples and M input features find the best recursive partitioning to predict the class labels in the leaf nodes Splitting, pruning, balancing... approaches differentiate algorithms 9

Classification trees example Loh, 2011, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10

Random forests Ensemble method: grow T trees for a forest For M input features, choose m < M For each t T: Select N training samples with replacement from all N samples At every split, choose m random features. Use the best split among those for building the tree 11

Random forests - Popularity Fast No black box Intuitive to interpret Good performance Jensen et al., 2011, Bioinformatics 12

TMSEG step 1 Initial prediction Random Forest (T = 100, m = 9) Sliding window of 19 residues (w = 19) 3 scores for each residue (0-1000): Signal peptide Transmembrane helix Soluble Scores scaled from 0.0..1.0 to 0..1000 13

TMSEG overview Step 1 14

TMSEG step 1 - Feature set I Global features: Global amino acid composition Protein length Local features: PSSM score Distance to N- and C-terminus Average hydrophobicity (Kyte-Doolittle) % hydrophobic % charged (positive & negative) w = 9 % polar 15

TMSEG step 1 - Feature set II Adjusting for conservation Substitutions with score > 0 = 16 Substitutions with score < 0 = 79 16

TMSEG step 1 - Feature set III Adjusting for conservation Amino acid composition M (PSSM>0) = 1/16 17

TMSEG step 1 - Feature set IV Adjusting for conservation Amino acid composition M (PSSM>0) = 1/16 Amino acid composition M (PSSM<0) = 3/79 18

TMSEG step 1 - Feature set V Adjusting for conservation % positive charge (PSSM>0) = 2/16 % positive charge (PSSM<0) = 8/79 19

TMSEG step 1 - Feature set VI Global features: PSSM 0 Global amino acid composition 2*20 Protein length (binned) 1 Local features: PSSM score 21*19 Distance to N- and C-terminus 2 PSSM 0 PSSM 0 PSSM 0 PSSM 0 Average hydrophobicity (Kyte-Doolittle) 2*1 % hydrophobic 2*1 % charged (positive & negative) 2*2 % polar 2*1 20

TMSEG step 2 Empirical filter Smooth scores with median filter (w = 5) Adjust scores to avoid overprediction soluble: -185 TMH: -60 Assign each residue to state with highest score Remove signal peptides with <4 residues Remove TMHs with <7 residues 21

TMSEG step 2 Example SEQ: M G P R A R P A L L L L... SIG: 400 400 100 100 800 600 700 900 100 600 100 800... SOL: 500 400 600 500 100 100 100 000 500 100 100 200... TMH: 100 200 300 400 100 300 200 100 400 300 800 000... à Median filter SIG: 400 400 400 400 600 700 700 600 600 600... SOL: 500 500 500 400 100 100 100 100 100 100... TMH: 100 200 200 300 300 200 200 300 300 300... à Adjust for overprediction SIG: 400 400 400 400 600 700 700 600 600 600... SOL: 315 315 315 215-85 -85-85 -85-85 -85... TMH: 040 140 140 240 240 140 140 240 240 240... OUT: S S S S S S S S S S... 22

TMSEG overview Step 1 & 2 23

TMSEG step 3 Refine TMH prediction I Neural Network (25 hidden nodes) Input: TMH segments of variable length Features: PSSM 0 PSSM 0 PSSM 0 PSSM 0 Amino acid composition 2*20 Average hydrophobicity (Kyte-Doolittle) 2*1 % hydrophobic 2*1 % charged 2*1 Segment length (exact) 1 24

TMSEG step 3 Refine TMH prediction II Split long TMHs ( 35 residues) into two shorter TMHs ( 17 residues) Keep two TMHs if higher average score after split Adjust TMH endpoints by up to 3 residues in either direction 25

TMSEG overview Step 1-3 26

TMSEG step 4 Topology prediction I Random Forest (T = 100, m = 7) Assign soluble segments to side 1 or 2 Features: PSSM 0 PSSM 0 PSSM 0 Amino acid composition 2*2*20 % positive charge 2*2*1 % abs. difference of pos. charge side1/side2 2*1 27

TMSEG step 4 Topology prediction II Consider only residues close to TMHs 15 residues next to TMHs and 8 residues into TMHs Predict topology of N-terminus and extrapolate If SP predicted à Residues after SP outside 28

TMSEG overview Step 1-4 29

Performance measures I Per-residue measures often misleading à Score by TMH segments instead Whole-protein scores: Q ok and Q top 30

Performance measures II r i : #correctly predicted TMHs B #observed TMHs p i : #correctly predicted TMHs B #predicted TMHs Q ok : L 100 N C 1, if pi = ri = 100% x i ; xi = G 0, else MNO 31

Performance measures III What is a correctly predicted TMH? Strict criteria Endpoint deviation 5 residues Overlap at least 50% 32

Performance measures IV t i : 100% if toplogy is correct, otherwise 0% Q top : L 100 N C 1, if ti = pi = ri = 100% y i ; yi = G 0, else MNO 33

Performance of TMH predictions 34

Performance measures TMP classification FPR: 100 # of incorrectly predicted TMPs # of soluble proteins Sensitvity: 100 # of correctly predicted TMPs # of observed TMPs Compare to a simple predictor ( Baseline ) Uses only hydrophobicity scale and positive-inside rule 35

TMP classification Very low misclassification rates Method TMP sensitivity TMP FPR Topology correct Misclassified in human More mistakes than TMSEG in human TMSEG 98 ± 2 3 ± 1 93 ± 4 558 - PolyPhobius 100 ± 0 5 ± 1 78 ± 7 770 212 MEMSAT3 100 ± 0 28 ± 2 93 ± 4 4,313 3,755 MEMSAT-SVM 98 ± 2 14 ± 2 88 ± 5 2,253 1,695 Baseline 95 ± 3 31 ± 2 75 ± 7 5,015 4,457 36

Dataset of 12 new proteins How to get more data? Use what was published since starting work à Data unknown by any method From 07/2013 to 2016/02: Only 12 new TMPs published Very small dataset TMSEG predicts every TMH of the 10 recognized TMPs 37

Applying TMSEG to other methods I High modularity (steps 1-4) Apply steps 3 and 4 to other methods Step3: NN-based TMH prediction improvement Step4: RF-based topology prediction Can this improve other methods? 38

Applying TMSEG to other methods II 39

Potential extensions Re-entrant regions not modelled (little data) Idea: Check abnormal TMH segments for reentrant Does not switching topology increase scores? 40

Availability Debian package: http://rostlab.org/debian/pool/main/t/tmseg/ Github: github.com/rostlab/tmseg PredictProtein: predictprotein.org Yachdav et al., 2014, NAR 41

Thank you Unknown source L 42

References Yachdav, G., Kloppmann, E., Kajan, L., Hecht, M., Goldberg, T., Hamp, T., Rost, B. (2014). PredictProtein-an open resource for online prediction of protein structural and functional features. Nucleic Acids Research, 42(Web Server issue), W337 43. http://doi.org/10.1093/nar/gku366 Jensen, L. J., & Bateman, A. (2011). The rise and fall of supervised machine learning techniques. Bioinformatics, 27(24), 3331 3332. http://doi.org/10.1093/bioinformatics/btr585 Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 14 23. http://doi.org/10.1002/widm.8 Lomize, M. a, Lomize, A. L., Pogozheva, I. D., & Mosberg, H. I. (2006). OPM: orientations of proteins in membranes database. Bioinformatics (Oxford, England), 22(5), 623 5. http://doi.org/10.1093/bioinformatics/btk023 Velankar, S., McNeil, P., Mittard-Runte, V., Suarez, a, Barrell, D., Apweiler, R., & Henrick, K. (2005). E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Research, 33(Database issue), D262 5. http://doi.org/10.1093/nar/gki058 Kozma, D., Simon, I., & Tusnády, G. E. (2013). PDBTM: Protein Data Bank of transmembrane proteins after 8 years. Nucleic Acids Research, 41(Database issue), D524 9. http://doi.org/10.1093/nar/gks1169 Mika, S., & Rost, B. (2003). UniqueProt: creating representative protein sequence sets. Nucleic Acids Research, 31(13), 3789 3791. http://doi.org/10.1093/nar/gkg620 43