IT og Sundhed 2010/11

Similar documents
1-D Predictions. Prediction of local features: Secondary structure & surface exposure

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter

Protein Secondary Structure Prediction using Pattern Recognition Neural Network

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Basics of protein structure

Physiochemical Properties of Residues

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Bioinformatics: Secondary Structure Prediction

Optimization of the Sliding Window Size for Protein Structure Prediction

SUPPLEMENTARY MATERIALS

Protein Structure Prediction and Display

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy

Protein Secondary Structure Assignment and Prediction

Bioinformatics: Secondary Structure Prediction

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Analysis and Prediction of Protein Structure (I)

Week 10: Homology Modelling (II) - HHpred

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

Conformational Geometry of Peptides and Proteins:

Programme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues

Protein Secondary Structure Prediction using Feed-Forward Neural Network

Protein Secondary Structure Prediction

Lecture 7. Protein Secondary Structure Prediction. Secondary Structure DSSP. Master Course DNA/Protein Structurefunction.

Can protein model accuracy be. identified? NO! CBS, BioCentrum, Morten Nielsen, DTU

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Bioinformatics. Macromolecular structure

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins

Ranjit P. Bahadur Assistant Professor Department of Biotechnology Indian Institute of Technology Kharagpur, India. 1 st November, 2013

Bayesian Models and Algorithms for Protein Beta-Sheet Prediction

Protein Bioinformatics Computer lab #1 Friday, April 11, 2008 Sean Prigge and Ingo Ruczinski

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

HOMOLOGY MODELING. The sequence alignment and template structure are then used to produce a structural model of the target.

Advanced Certificate in Principles in Protein Structure. You will be given a start time with your exam instructions

BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer Protein Structure Prediction I

Protein Structure. Hierarchy of Protein Structure. Tertiary structure. independently stable structural unit. includes disulfide bonds

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins

Supersecondary Structures (structural motifs)

PDBe TUTORIAL. PDBePISA (Protein Interfaces, Surfaces and Assemblies)

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Protein Structure: Data Bases and Classification Ingo Ruczinski

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

Protein quality assessment

7 Protein secondary structure

HIV protease inhibitor. Certain level of function can be found without structure. But a structure is a key to understand the detailed mechanism.

Introducing Hippy: A visualization tool for understanding the α-helix pair interface

BCH 4053 Spring 2003 Chapter 6 Lecture Notes

CAP 5510 Lecture 3 Protein Structures

Predicting Protein Structural Features With Artificial Neural Networks

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

8 Protein secondary structure

We used the PSI-BLAST program ( to search the

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

An Artificial Neural Network Classifier for the Prediction of Protein Structural Classes

PROTEIN SECONDARY STRUCTURE PREDICTION USING NEURAL NETWORKS AND SUPPORT VECTOR MACHINES

Protein Secondary Structure Prediction

Biomolecules: lecture 10

Objective: Students will be able identify peptide bonds in proteins and describe the overall reaction between amino acids that create peptide bonds.

Research Article Extracting Physicochemical Features to Predict Protein Secondary Structure

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

Packing of Secondary Structures

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

Template Based Protein Structure Modeling Jianlin Cheng, PhD

12 Protein secondary structure

Conditional Graphical Models

Structure and evolution of the spliceosomal peptidyl-prolyl cistrans isomerase Cwc27

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... methods use different attributes related to mis sense mutations such as

3D Structure. Prediction & Assessment Pt. 2. David Wishart 3-41 Athabasca Hall

Protein 8-class Secondary Structure Prediction Using Conditional Neural Fields

DATE A DAtabase of TIM Barrel Enzymes

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structures: Experiments and Modeling. Patrice Koehl

Lecture 26: Polymers: DNA Packing and Protein folding 26.1 Problem Set 4 due today. Reading for Lectures 22 24: PKT Chapter 8 [ ].

ALL LECTURES IN SB Introduction

Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data

PRI-Modeler: Extracting RNA structural elements from PDB files of protein RNA complexes

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS

Prediction of protein secondary structure by mining structural fragment database

Student Questions and Answers October 8, 2002

Pymol Practial Guide

Protein-Protein Interaction Classification Using Jordan Recurrent Neural Network

Problem Set 1

Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments

Lecture 14 Secondary Structure Prediction

Protein Structure and Visualisation. Introduction to PDB and PyMOL

4 Proteins: Structure, Function, Folding W. H. Freeman and Company

Useful background reading

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

Protein structure alignments

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Protein Secondary Structure Prediction

Matching Protein/3-Sheet Partners by Feedforward and Recurrent Neural Networks

Orientational degeneracy in the presence of one alignment tensor.

Variable-Length Protein Sequence Motif Extraction Using Hierarchically-Clustered Hidden Markov Models

Section II Understanding the Protein Data Bank

Transcription:

IT og Sundhed 2010/11 Sequence based predictors. Secondary structure and surface accessibility Bent Petersen 13 January 2011 1

NetSurfP Real Value Solvent Accessibility predictions with amino acid associated reliability 2

Objective Predict residues as being either buried or exposed (25 % threshold) - Two states/classes, Buried/Exposed Predict the Relative Solvent Accessibility, RSA - Real Value 3

What is ASA? Accessible Solvent Area, Å 2 Surface area accessible to a rolling water molecule 4

RSA RSA = ACC protein ASA tripeptid RSA = Relative Solvent Accessibility ACC = Accessible area in protein structure ASA = Accessible Surface Area in Gly-X-Gly or Ala-X-Ala Classification Networks Real value Networks Classification: Buried = RSA < 25 %, Exposed = RSA > 25 % Real Value: values 0-1, RSA > 1 set to 1 5

Why predict RSA? Residues exposed on surface can be: - Involved in PTM s - Potential epitopes - Involved in Protein-Protein interactions - Prediction of Disease-SNP s 6

How to start? What do we want? - We want to be able to predict the exposure of an AA What do we need? - A training dataset and an independent evaluation dataset What information do we need? - True structural information the Neural Network can train on Where do we get that? - PDB, DSSP 7

Protein Data Bank, PDB Berman, H.M., et al., The Protein Data Bank. Nucl. Acids Res., 2000. 28(1): p. 235-242. 8

Define Secondary Structure of Proteins, DSSP ==== Secondary Structure Definition by the program DSSP, updated CMBI version by ElmK / April 1,2000 ==== DATE=23-MAR-2009. REFERENCE W. KABSCH AND C.SANDER, BIOPOLYMERS 22 (1983) 2577-2637. HEADER TOXIN 12-AUG-98 3BTA. COMPND 2 MOLECULE: PROTEIN (BOTULINUM NEUROTOXIN TYPE A);. SOURCE 2 ORGANISM_SCIENTIFIC: CLOSTRIDIUM BOTULINUM;. AUTHOR R.C.STEVENS,D.B.LACY. 1277 2 2 1 1 TOTAL NUMBER OF RESIDUES, NUMBER OF CHAINS, NUMBER OF SS-BRIDGES(TOTAL,INTRACHAIN,INTERCHAIN). 55121.0 ACCESSIBLE SURFACE OF PROTEIN (ANGSTROM**2). 815 63.8 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(J), SAME NUMBER PER 100 RESIDUES. 24 1.9 TOTAL NUMBER OF HYDROGEN BONDS IN PARALLEL BRIDGES, SAME NUMBER PER 100 RESIDUES. 198 15.5 TOTAL NUMBER OF HYDROGEN BONDS IN ANTIPARALLEL BRIDGES, SAME NUMBER PER 100 RESIDUES. 1 0.1 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I-5), SAME NUMBER PER 100 RESIDUES. 10 0.8 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I-4), SAME NUMBER PER 100 RESIDUES. 125 9.8 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+2), SAME NUMBER PER 100 RESIDUES. 134 10.5 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+3), SAME NUMBER PER 100 RESIDUES. 276 21.6 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+4), SAME NUMBER PER 100 RESIDUES. 9 0.7 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+5), SAME NUMBER PER 100 RESIDUES. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 *** HISTOGRAMS OF ***. 0 0 0 0 0 3 3 1 2 1 0 3 1 1 0 1 0 0 1 0 1 0 1 1 0 0 0 0 0 2 RESIDUES PER ALPHA HELIX. 2 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PARALLEL BRIDGES PER LADDER. 15 10 7 5 8 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ANTIPARALLEL BRIDGES PER LADDER. 3 3 0 0 1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 LADDERS PER SHEET. # RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA 1 1 A P 0 0 5 0, 0.0 2,-3.8 0, 0.0 3,-0.2 0.000 360.0 360.0 360.0 132.0 74.7 55.7 73.4 2 2 A F - 0 0 115 92,-0.4 93,-0.1 1,-0.1 36,-0.1-0.206 360.0-142.1 55.7-62.1 74.7 59.2 74.7 3 3 A V - 0 0 11-2,-3.8 35,-0.2 91,-0.1-1,-0.1 0.867 4.9-143.8 70.2 103.3 78.3 59.8 73.7 4 4 A N S S+ 0 0 127 33,-0.3 2,-0.5-3,-0.2 33,-0.1 0.914 73.7 44.0-67.5-53.8 80.1 61.9 76.4 5 5 A K S S- 0 0 94 32,-0.1 2,-0.5 1,-0.0-1,-0.1-0.857 79.6-124.0-105.1 133.1 82.5 64.2 74.5 6 6 A Q - 0 0 192-2,-0.5 2,-0.1 1,-0.1 82,-0.1-0.568 35.9-150.4-71.8 118.5 81.6 66.2 71.4 7 7 A F - 0 0 14-2,-0.5 2,-0.3 80,-0.1 3,-0.1-0.388 16.9-164.3-91.4 166.8 84.2 65.3 68.7 8 8 A N > - 0 0 71-2,-0.1 3,-0.9 1,-0.1 77,-0.0-0.977 28.9-124.4-143.4 141.5 85.7 67.1 65.7 9 9 A Y T 3 S+ 0 0 17-2,-0.3-1,-0.1 1,-0.2 72,-0.1 0.908 109.3 50.7-57.8-43.3 87.5 65.3 62.9 10 10 A K T 3 S+ 0 0 141-3,-0.1-1,-0.2 70,-0.1 3,-0.1 0.650 77.9 122.5-70.3-17.2 90.7 67.4 63.3 11 11 A D S < S- 0 0 45-3,-0.9 3,-0.1 1,-0.1 2,-0.1-0.203 77.6-91.4-48.0 134.3 91.0 66.8 67.1 12 12 A P - 0 0 99 0, 0.0-1,-0.1 0, 0.0-2,-0.1-0.246 38.0-108.3-57.6 128.3 94.4 65.3 67.8 13 13 A V + 0 0 41-3,-0.1 6,-0.2 1,-0.1 4,-0.1-0.238 38.6 179.2-51.8 138.5 94.8 61.5 67.8 14 14 A N - 0 0 67 4,-3.7 2,-1.4 2,-0.2 5,-0.2-0.085 45.1-107.4-144.3 45.7 95.4 60.3 71.4 15 15 A G S S+ 0 0 0 122,-0.4 2,-0.3 3,-0.2 4,-0.2 0.248 100.3 58.5 54.3-18.1 95.7 56.6 71.7 16 16 A V S S- 0 0 72-2,-1.4-2,-0.2 2,-0.5 20,-0.1-0.996 116.3-7.4-142.5 145.9 92.2 56.3 73.3 17 17 A D S S+ 0 0 22-2,-0.3 19,-2.5 18,-0.1 2,-0.2 0.389 136.6 45.3 53.3-7.2 88.7 57.3 72.3 18 18 A I E S+A 35 0A 6 17,-0.3-4,-3.7-11,-0.0-2,-0.5-0.649 85.9 128.7-161.1 96.3 90.4 59.0 69.2 Kabsch, W. and C. Sander, Dictionary of Protein Secondary Structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 1983. 22 (12): p. 2577--2637. 9

Define Secondary Structure of Proteins, DSSP DSSP defines 8 types of secondary structure - G = 3-turn helix (3-10 helix) - H = 4-turn helix (α-helix) - I = 5-turn helix (π-helix) - T = Hydrogen bonded turn (3, 4 or 5 turn) - E = Extended strand - B = Residue in isolated β-bridge - S = Bend - Rest is C = coil 10

Required datasets Training/test - Used for optimization of settings using 10-fold crossvalidation Evaluation - Used for final evaluation, less than 25 % homolog to the training/test dataset. 11

10-fold Cross Validation 10-fold Cross Validation - Break dataset into 10 sets of size 1/10 - Train on 9 datasets and test on 1 - Repeat 10 times and take a mean accuracy 12

Learning / Training dataset Training set: Cull_1764: - Max. Seq. ID: 25 % - Resolution: 2.0 Å - R-Factor: 0.2 - Seq. Length 30-3000 AA - Including X-ray entries only 13

PISCES 14

Learning / Training dataset Homology reduced towards evaluation set CB513 (302 sequences removed) Final Training set: - 1764 sequences - 417.978 amino acids Buried: 55.80 % (233.221 amino acids) Exposed: 44.20 % (184.757 amino acids) 15

Learning / Training dataset ---Sequence/residue statistics--- Number of seq.: 1764 Longest seq.: 1T3T.A (1283) Shortest seq.: 1YTV.M(6) Number of amino acids: 417978 ---Assignment category statistics --- B 184757 ( 44.20%) A 233221 ( 55.80%) ---Amino acid statistics--- H 10025 ( 2.40%) G 31743 ( 7.59%) Y 14927 ( 3.57%) V 30171 ( 7.22%) E 27774 ( 6.64%) S 24430 ( 5.84%) P 19589 ( 4.69%) A 35658 ( 8.53%) R 21435 ( 5.13%) Q 15535 ( 3.72%) C 5202 ( 1.24%) K 23054 ( 5.52%) L 38489 ( 9.21%) N 17756 ( 4.25%) T 22998 ( 5.50%) F 17181 ( 4.11%) D 24743 ( 5.92%) I 23550 ( 5.63%) W 6365 ( 1.52%) M 7353 ( 1.76%) 16

Evaluation dataset Final Evaluation dataset: CB513: - 513 non-homologous sequences - Seq. Length 20-754 aa - 84.119 amino acids - Buried: 55.81 % (46.948 amino acids) - Exposed: 44.19 % (37.171 amino acids) 17

Evaluation dataset ---Sequence/residue statistics--- Number of seq.: 513 Longest seq.: 6acn.all(754) Shortest seq.: 1atpi-1(20) Number of amino acids: 84119 ---Assignment category statistics --- B 37171 ( 44.19%) A 46948 ( 55.81%) ---Amino acid statistics--- R 3812 ( 4.53%) T 5015 ( 5.96%) D 4973 ( 5.91%) C 1381 ( 1.64%) Y 3065 ( 3.64%) G 6657 ( 7.91%) N 3976 ( 4.73%) V 5795 ( 6.89%) I 4642 ( 5.52%) A 7267 ( 8.64%) S 5222 ( 6.21%) K 4976 ( 5.92%) P 3903 ( 4.64%) E 5050 ( 6.00%) L 7134 ( 8.48%) Q 3108 ( 3.69%) M 1710 ( 2.03%) H 1865 ( 2.22%) W 1236 ( 1.47%) F 3268 ( 3.88%) X 19 ( 0.02%) B 31 ( 0.04%) Z 14 ( 0.02%) 18

Aminoacid Distribution % 10 8 6 4 2 0 A C D E F G H I K L M N P Q R S T V W Y Cull/Learning CB513 Cull/Learning 8.53 1.24 5.92 6.64 4.11 7.59 2.40 5.63 5.52 9.21 1.76 4.25 4.69 3.72 5.13 5.84 5.50 7.22 1.52 3.57 CB513 8.64 1.64 5.91 6.00 3.88 7.91 2.22 5.52 5.92 8.48 2.03 4.73 4.64 3.69 4.53 6.21 5.96 6.89 1.47 3.64 Amino acids 19

Neural Network - Input Position Specific Scoring Matrices, PSSM A R N D C Q E G H I L K M F P S T W Y V B H 2BEM.A 1-4 -3-2 -4-6 -2-3 -5 11-6 -5-3 -4-4 -5-3 -4-5 -1-6 A G 2BEM.A 2-2 -5-3 -4-5 -4-5 7-5 -7-6 -4-5 -6-5 -3-4 -5-6 -6 A Y 2BEM.A 3-1 1-4 -3-5 -4-4 -4 1-4 -1-4 -1 2-5 0-1 4 7-2 A V 2BEM.A 4-1 -5-5 -6-4 -4-5 -5-5 4 1-5 6-3 -2-2 0-5 -4 4 B E 2BEM.A 5-2 -4-3 0-4 -1 3-2 -4 0-3 -2 1-2 -3 3 3-5 -4 0 4 time iterativ psi-blast against nr70 Secondary Structure predictions B H 2BEM.A 1 0.003 0.003 0.966 A G 2BEM.A 2 0.018 0.086 0.868 A Y 2BEM.A 3 0.020 0.199 0.752 A V 2BEM.A 4 0.021 0.271 0.679 B E 2BEM.A 5 0.020 0.199 0.752 (sec predictor by Pernille Andersen) 20

Secondary structure predictor Developed by Pernille Andersen, incorporated in NetSurfP Trained on 2,085 sequences using DSSP - H = H, E = E, C =., G, I, B, S and T - H 30 %, E 20 %, C 50 % Performance of ~80 % Maximum theoretical limit is ~88 % 21

Neural Network - Settings Window Size: 11-19 Hidden units: 10, 20, 25, 30, 40, 50, 75, 150, (200) Learning rate: 0.01 / (0.005) Epocs (training rounds): 200 10-fold cross-validation - 9/10 used for training, 1/10 for testing 22

Neural network window Sliding window of 7 170 2BEM.A mol:aa CHITIN-BINDING PROTEIN HGYVESPASRAYQCKLQLNTQCGSVQYEPQSVEGLKGFPQAGPADGHIASADKSTFFELDQQTPTRWNKLNLKTGPNSFT WKLTARHSTTSWRYFITKPNWDASQPLTRASFDLTPFCQFNDGGAIPAAQVTHQCNIPADRSGSHVILAVWDIADTANAF YQAIDVNLSK BAAABBAAAAAAAABBBBABBABBAABBABAABABBBAABBBABBABAAAABBBBABAAABABBBAABABBABAABABAA ABABBBBAABAAAAAAABBBABABBBAAABAABBBAAAAAABBBBBABBBABABABAABBABBBAAAAAAAAABBBBBAA AAAAAABABB Prediction on middle residue Serine, buried 23

Neural network window Sliding window of 7 170 2BEM.A mol:aa CHITIN-BINDING PROTEIN HGYVESPASRAYQCKLQLNTQCGSVQYEPQSVEGLKGFPQAGPADGHIASADKSTFFELDQQTPTRWNKLNLKTGPNSFT WKLTARHSTTSWRYFITKPNWDASQPLTRASFDLTPFCQFNDGGAIPAAQVTHQCNIPADRSGSHVILAVWDIADTANAF YQAIDVNLSK BAAABBAAAAAAAABBBBABBABBAABBABAABABBBAABBBABBABAAAABBBBABAAABABBBAABABBABAABABAA ABABBBBAABAAAAAAABBBABABBBAAABAABBBAAAAAABBBBBABBBABABABAABBABBBAAAAAAAAABBBBBAA AAAAAABABB Prediction on middle residue Proline, exposed 24

Neural network window Sliding window of 7 170 2BEM.A mol:aa CHITIN-BINDING PROTEIN HGYVESPASRAYQCKLQLNTQCGSVQYEPQSVEGLKGFPQAGPADGHIASADKSTFFELDQQTPTRWNKLNLKTGPNSFT WKLTARHSTTSWRYFITKPNWDASQPLTRASFDLTPFCQFNDGGAIPAAQVTHQCNIPADRSGSHVILAVWDIADTANAF YQAIDVNLSK BAAABBAAAAAAAABBBBABBABBAABBABAABABBBAABBBABBABAAAABBBBABAAABABBBAABABBABAABABAA ABABBBBAABAAAAAAABBBABABBBAAABAABBBAAAAAABBBBBABBBABABABAABBABBBAAAAAAAAABBBBBAA AAAAAABABB Prediction on middle residue Alanine, exposed 25

Method 26

Wisdom of the crowd Selecting best performing network architectures based on test performance Better than choosing any single network 10-fold % correct predictions Average of set A-J w. sec. structure 79.80 79.75 79.72 79.75 79.75 79.75 79.74 79.75 79.76 79.77 79.77 79.76 79.75 79.76 79.75 79.75 79.76 79.77 79.77 79.70 79.69 79.66 79.65 % correct 79.60 79.55 79.55 79.50 79.45 79.40 Average of top 1 Average of top 2 Average of top 3 Average of top 4 Average of top 5 Average of top 6 Average of top 7 Average of top 8 Average of top 9 Average of top 10 Series1 79.55 79.66 79.69 79.72 79.75 79.75 79.75 79.74 79.75 79.76 79.77 79.77 79.76 79.75 79.76 79.75 79.75 79.76 79.77 79.77 Average of top 11 Series1 Average of top 12 Average of top 13 Average of top 14 Average of top 15 Average of top 16 Average of top 17 Average of top 18 Average of top 19 Average of top 20 27

Results - Classification networks Training: % Correct MCC #Networks Best Single Architecture 79.5 0.587 10 All Architectures 79.7 0.592 400 Top 20 Architectures 79.8 0.593 200 28

29

Results - Classification networks Training: % Correct MCC #Networks Best Single Architecture 79.5 0.587 10 All Architectures 79.7 0.592 400 Top 20 Architectures 79.8 0.593 200 Evaluation: % Correct MCC Dor and Zhou 78.8 Not Published NetsurfP CB500/CB513 79.0 0 0.577 30

Results Evaluation 31

NetSurfP /usr/cbs/bio/src/netsurfp/netsurfp -h 32

NetSurfP 33

NetDiseaseSNP Disease-SNP prediction (Morten Bo Johansen) Without NetSurfP: Cross-validation: MCC= 0.569 Cross-Evaluation: MCC= 0.560 With NetSurfP: Cross-validation: MCC= 0.583 Cross-Evaluation: MCC= 0.572 34

Paper is out..what then? 35

Statistics Submissions to the webserver from CBS website 36

Paper is out..what then? 37

Paper is out..what then? 38

Paper is out..what then? 39

40

41

42

As of 12 Jan 2011 136003 sequences submitted from 13494 unique IP s 43

First citation 24 october 2009 :-) 44

45