TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg

Similar documents
proteins TMSEG: Novel prediction of transmembrane helices Michael Bernhofer, 1 * Edda Kloppmann, 1,2 Jonas Reeb, 1 and Burkhard Rost 1,2,3,4

protein. Evaluation of transmembrane helix predictions in 2014 Jonas Reeb, 1 Edda Kloppmann, 1,2 * Michael Bernhofer, 1 and Burkhard Rost 1,2,3,4

SUPPLEMENTARY MATERIALS

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Holdout and Cross-Validation Methods Overfitting Avoidance

Data Mining und Maschinelles Lernen

A Machine Text-Inspired Machine Learning Approach for Identification of Transmembrane Helix Boundaries

Statistics and learning: Big Data

A benchmark server using high resolution protein structure data, and benchmark results for membrane helix predictions. Rath et al.

Learning with multiple models. Boosting.

Machine Learning in Action

TMHMM2.0 User's guide

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction

BIOINFORMATICS. Enhanced Recognition of Protein Transmembrane Domains with Prediction-based Structural Profiles

Computational Genomics and Molecular Biology, Fall

CAP 5510 Lecture 3 Protein Structures

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Nonlinear Classification

Model Accuracy Measures

Topology Prediction of Helical Transmembrane Proteins: How Far Have We Reached?

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Jessica Wehner. Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Classification using stochastic ensembles

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Dyadic Classification Trees via Structural Risk Minimization

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Neural Networks and Ensemble Methods for Classification

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Learning Time-Series Shapelets

Induction of Decision Trees

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Knowledge Discovery and Data Mining

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Enhanced membrane protein topology prediction using a hierarchical classification method and a new scoring function

Supervised Learning. George Konidaris

Source localization in an ocean waveguide using supervised machine learning

Support Vector Machine & Its Applications

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Decision trees COMS 4771

Public Database 의이용 (1) - SignalP (version 4.1)

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

UVA CS 4501: Machine Learning

An Artificial Neural Network Classifier for the Prediction of Protein Structural Classes

Decision T ree Tree Algorithm Week 4 1

10701/15781 Machine Learning, Spring 2007: Homework 2

The human transmembrane proteome

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Protein 8-class Secondary Structure Prediction Using Conditional Neural Fields

Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Predictive Analytics on Accident Data Using Rule Based and Discriminative Classifiers

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery

K-means-based Feature Learning for Protein Sequence Classification

Linear Classifiers. Michael Collins. January 18, 2012

Decision Tree Learning Lecture 2

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

Machine Learning Lecture 7

Combination of M-Estimators and Neural Network Model to Analyze Inside/Outside Bark Tree Diameters

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Protein structure alignments

Neural Networks: Backpropagation

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Symbolic methods in TC: Decision Trees

Analysis of N-terminal Acetylation data with Kernel-Based Clustering

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

The TOPCONS webserver for consensus prediction of membrane protein topology and signal peptides

Introduction to Bioinformatics Online Course: IBT

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

ECE521 Lectures 9 Fully Connected Neural Networks

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

A Discriminatively Trained, Multiscale, Deformable Part Model

Day 3: Classification, logistic regression

Support Vector Machines. Machine Learning Fall 2017

Stat 502X Exam 2 Spring 2014

Harrison B. Prosper. Bari Lectures

Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments

Hierarchical models for the rainfall forecast DATA MINING APPROACH

Predictors (of secondary structure) based on Machine Learning tools

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning & Data Mining

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Multivariate Methods in Statistical Data Analysis

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction using Feed-Forward Neural Network

Logistic Regression. COMP 527 Danushka Bollegala

Midterm: CS 6375 Spring 2015 Solutions

Transcription:

title: short title: TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg lecture: Protein Prediction 1 (for Computational Biology) Protein structure TUM summer semester 09.06.2016 1

Last time 2

3

Yet another transmembrane predictor? More data available Re-training old methods is viable but no one does it Less extensive machine learning Runtime 4

Dataset Transmembrane helices I 166 membrane protein sequences (TMP166) TMH assignment from 3D-structure by OPM & PDBTM Assignments differ, both used for training Map to UniProt sequence using SIFTS Redundancy reduction with Uniqueprot at HVAL>0 Lomize et al., 2006, Bioinformatics Kozma et al., 2013, NAR Velankar et al., 2013, NAR Mika et al., 2003, NAR 5

Dataset Transmembrane helices II Inside/Outside topology assignment OPM Lomize et al., 2006, Bioinformatics 6

Dataset Proteins w/ and w/o signal peptides Derived from the SignalP 4.0 training set Redundancy reduced against set of 166 TMPs at HVAL>0 Redundancy reduced within at HVAL>0 Soluble: 1142 (452 w/ SP) Membrane: 299 (25 w/ SP) SP1441 7

Dataset Split Split into 4 subsets, maintaining distribution of TMPs, SPs and sequence lengths Use 3 sets for cross-validation, keep one for final independent evaluation (Blind set) Blind Blind TMP166 41 SP1441 285 Train Train 8

Classification trees Given N training samples and M input features find the best recursive partitioning to predict the class labels in the leaf nodes Splitting, pruning, balancing... approaches differentiate algorithms 9

Classification trees example Loh, 2011, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10

Random forests Ensemble method: grow T trees for a forest For M input features, choose m < M For each t T: Select N training samples with replacement from all N samples At every split, choose m random features. Use the best split among those for building the tree 11

Random forests - Popularity Fast No black box Intuitive to interpret Good performance Jensen et al., 2011, Bioinformatics 12

TMSEG step 1 Initial prediction Random Forest (T = 100, m = 9) Sliding window of 19 residues (w = 19) 3 scores for each residue (0-1000): Signal peptide Transmembrane helix Soluble Scores scaled from 0.0..1.0 to 0..1000 13

TMSEG overview Step 1 14

TMSEG step 1 - Feature set I Global features: Global amino acid composition Protein length Local features: PSSM score Distance to N- and C-terminus Average hydrophobicity (Kyte-Doolittle) % hydrophobic % charged (positive & negative) w = 9 % polar 15

TMSEG step 1 - Feature set II Adjusting for conservation Substitutions with score > 0 = 16 Substitutions with score < 0 = 79 16

TMSEG step 1 - Feature set III Adjusting for conservation Amino acid composition M (PSSM>0) = 1/16 17

TMSEG step 1 - Feature set IV Adjusting for conservation Amino acid composition M (PSSM>0) = 1/16 Amino acid composition M (PSSM<0) = 3/79 18

TMSEG step 1 - Feature set V Adjusting for conservation % positive charge (PSSM>0) = 2/16 % positive charge (PSSM<0) = 8/79 19

TMSEG step 1 - Feature set VI Global features: PSSM 0 Global amino acid composition 2*20 Protein length (binned) 1 Local features: PSSM score 21*19 Distance to N- and C-terminus 2 PSSM 0 PSSM 0 PSSM 0 PSSM 0 Average hydrophobicity (Kyte-Doolittle) 2*1 % hydrophobic 2*1 % charged (positive & negative) 2*2 % polar 2*1 20

TMSEG step 2 Empirical filter Smooth scores with median filter (w = 5) Adjust scores to avoid overprediction soluble: -185 TMH: -60 Assign each residue to state with highest score Remove signal peptides with <4 residues Remove TMHs with <7 residues 21

TMSEG step 2 Example SEQ: M G P R A R P A L L L L... SIG: 400 400 100 100 800 600 700 900 100 600 100 800... SOL: 500 400 600 500 100 100 100 000 500 100 100 200... TMH: 100 200 300 400 100 300 200 100 400 300 800 000... à Median filter SIG: 400 400 400 400 600 700 700 600 600 600... SOL: 500 500 500 400 100 100 100 100 100 100... TMH: 100 200 200 300 300 200 200 300 300 300... à Adjust for overprediction SIG: 400 400 400 400 600 700 700 600 600 600... SOL: 315 315 315 215-85 -85-85 -85-85 -85... TMH: 040 140 140 240 240 140 140 240 240 240... OUT: S S S S S S S S S S... 22

TMSEG overview Step 1 & 2 23

TMSEG step 3 Refine TMH prediction I Neural Network (25 hidden nodes) Input: TMH segments of variable length Features: PSSM 0 PSSM 0 PSSM 0 PSSM 0 Amino acid composition 2*20 Average hydrophobicity (Kyte-Doolittle) 2*1 % hydrophobic 2*1 % charged 2*1 Segment length (exact) 1 24

TMSEG step 3 Refine TMH prediction II Split long TMHs ( 35 residues) into two shorter TMHs ( 17 residues) Keep two TMHs if higher average score after split Adjust TMH endpoints by up to 3 residues in either direction 25

TMSEG overview Step 1-3 26

TMSEG step 4 Topology prediction I Random Forest (T = 100, m = 7) Assign soluble segments to side 1 or 2 Features: PSSM 0 PSSM 0 PSSM 0 Amino acid composition 2*2*20 % positive charge 2*2*1 % abs. difference of pos. charge side1/side2 2*1 27

TMSEG step 4 Topology prediction II Consider only residues close to TMHs 15 residues next to TMHs and 8 residues into TMHs Predict topology of N-terminus and extrapolate If SP predicted à Residues after SP outside 28

TMSEG overview Step 1-4 29

Performance measures I Per-residue measures often misleading à Score by TMH segments instead Whole-protein scores: Q ok and Q top 30

Performance measures II r i : #correctly predicted TMHs B #observed TMHs p i : #correctly predicted TMHs B #predicted TMHs Q ok : L 100 N C 1, if pi = ri = 100% x i ; xi = G 0, else MNO 31

Performance measures III What is a correctly predicted TMH? Strict criteria Endpoint deviation 5 residues Overlap at least 50% 32

Performance measures IV t i : 100% if toplogy is correct, otherwise 0% Q top : L 100 N C 1, if ti = pi = ri = 100% y i ; yi = G 0, else MNO 33

Performance of TMH predictions 34

Performance measures TMP classification FPR: 100 # of incorrectly predicted TMPs # of soluble proteins Sensitvity: 100 # of correctly predicted TMPs # of observed TMPs Compare to a simple predictor ( Baseline ) Uses only hydrophobicity scale and positive-inside rule 35

TMP classification Very low misclassification rates Method TMP sensitivity TMP FPR Topology correct Misclassified in human More mistakes than TMSEG in human TMSEG 98 ± 2 3 ± 1 93 ± 4 558 - PolyPhobius 100 ± 0 5 ± 1 78 ± 7 770 212 MEMSAT3 100 ± 0 28 ± 2 93 ± 4 4,313 3,755 MEMSAT-SVM 98 ± 2 14 ± 2 88 ± 5 2,253 1,695 Baseline 95 ± 3 31 ± 2 75 ± 7 5,015 4,457 36

Dataset of 12 new proteins How to get more data? Use what was published since starting work à Data unknown by any method From 07/2013 to 2016/02: Only 12 new TMPs published Very small dataset TMSEG predicts every TMH of the 10 recognized TMPs 37

Applying TMSEG to other methods I High modularity (steps 1-4) Apply steps 3 and 4 to other methods Step3: NN-based TMH prediction improvement Step4: RF-based topology prediction Can this improve other methods? 38

Applying TMSEG to other methods II 39

Potential extensions Re-entrant regions not modelled (little data) Idea: Check abnormal TMH segments for reentrant Does not switching topology increase scores? 40

Availability Debian package: http://rostlab.org/debian/pool/main/t/tmseg/ Github: github.com/rostlab/tmseg PredictProtein: predictprotein.org Yachdav et al., 2014, NAR 41

Thank you Unknown source L 42

References Yachdav, G., Kloppmann, E., Kajan, L., Hecht, M., Goldberg, T., Hamp, T., Rost, B. (2014). PredictProtein-an open resource for online prediction of protein structural and functional features. Nucleic Acids Research, 42(Web Server issue), W337 43. http://doi.org/10.1093/nar/gku366 Jensen, L. J., & Bateman, A. (2011). The rise and fall of supervised machine learning techniques. Bioinformatics, 27(24), 3331 3332. http://doi.org/10.1093/bioinformatics/btr585 Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 14 23. http://doi.org/10.1002/widm.8 Lomize, M. a, Lomize, A. L., Pogozheva, I. D., & Mosberg, H. I. (2006). OPM: orientations of proteins in membranes database. Bioinformatics (Oxford, England), 22(5), 623 5. http://doi.org/10.1093/bioinformatics/btk023 Velankar, S., McNeil, P., Mittard-Runte, V., Suarez, a, Barrell, D., Apweiler, R., & Henrick, K. (2005). E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Research, 33(Database issue), D262 5. http://doi.org/10.1093/nar/gki058 Kozma, D., Simon, I., & Tusnády, G. E. (2013). PDBTM: Protein Data Bank of transmembrane proteins after 8 years. Nucleic Acids Research, 41(Database issue), D524 9. http://doi.org/10.1093/nar/gks1169 Mika, S., & Rost, B. (2003). UniqueProt: creating representative protein sequence sets. Nucleic Acids Research, 31(13), 3789 3791. http://doi.org/10.1093/nar/gkg620 43