Protein bioinforma-cs. Åsa Björklund CMB/LICR

Protein bioinforma-cs Åsa Björklund CMB/LICR asa.bjorklund@licr.ki.se

In this lecture Protein structures and 3D structure predic-on Protein domains HMMs Protein networks Protein func-on annota-on / predic-on

PTM Localiza-on Degrada-on Muta-ons Selec-on Gene mrna Polypep-de Folding 3D protein Protein complex ATGATCATGGTTACAGGT AUGAUCAUGGUUACAGGU MAHRKYLI Structure is more conserved than sequence!

This lecture Protein structures and structure predic-on Protein domains Mo-fs, Profiles and HMMs Some servers Protein networks Func-on predic-on

Protein sequence databases UniProtKB Swiss- Prot Quality PIR- PSD TrEMBL Quan-ty UniMES (metagenomics samples) Entrez Protein (NCBI) Coding regions from GenBank and Swissprot, PIR, PDB, etc. RefSeq Non redundant databases Uniref100, Uniref90 etc. NCBI nr

Protein structure databases (Gutmanas et al. 2013) 1. So_ X- ray tomogram of a fission yeast cell 2. Electron tomogram of ribosomes in the cytosol 3. Cryo- EM reconstruc-on of the 80S ribosome from yeast 4. Crystal structure of the 50S ribosomal subunit (PDB entry 3uzk) 5. Crystal structure revealing how tmrna and the small protein SmpB enable the kirromycin- stalled 70S ribosome to proceed with transla-on (PDB entries 4abr and 4abs)

PDB Protein Data Bank http://www.rcsb.org/pdb Main database of three-dimensional protein structures Also contains structures of other macromolecules (DNA, RNA, carbohydrates) PDB entry format resembles EMBL Currently (May 2013) 90,424 entries

Viewing protein structures Pymol Jmol Rasmol Download PDB file Open in viewer Select, color, rotate

Structure predic-on De novo folding Molecular Dynamics Based on energy minimiza-on, depends on star-ng structure Works well for small pepe-des, not feasible for very large proteins Folding@home Homology modelling Template from homologous structure(s) Refinement based on energy minimiza-on Swiss- modeller, 3D- Jigsaw Consensus methods Combines predic-ons from several servers Pcons.net

Docking Predict interac-ons between proteins or protein and ligand Most proteins change conforma-on upon binding Ligand docking commonly used in drug design HADDOCK, PatchDock, ClusPro

Protein domains

Protein domains Defined as independent folding unit or independent evolving unit Each domain has a characteris-c structure and/or func-on conserved in evolu-on Domains are o_en combined to create mul-- domain proteins O_en grouped into families and superfamilies Known domains in ~80% of all proteins, covering ~58% of the residues

Protein domains

Protein Structure Classifica-on Databases QUALITY SCOP : All manual CATH : Semi-automatic FSSP : All automatic ENTREZ: All automatic QUANTITY

Structural Classifica-on Of Proteins (SCOP) http://scop.mrc-lmb.cam.ac.uk/scop/ Attempt to classify all proteins in PDB according to structural and evolutionary relationships Hierarchical classification system Classification based on human expertise Superfamily contains HMMs for all SCOP families (http://supfam.cs.bris.ac.uk/)

SCOP Main secondary structure elements Class " Fold Superfamily Family All- alpha All- beta Alpha/beta

SCOP Arrangements of secondary structure elements Class " Fold Superfamily Family Globin- like Prion- like Alpha- beta knot

SCOP Low sequence similarity but conserved structure and/ or func-on Class " Fold Superfamily Family

SCOP Significant sequence similarity and similar structure and func-on Class " Fold Superfamily Family

Domains based on sequence conserva-on Pfam - pfam.sanger.ac.uk PfamA manually curated PfamB autmated clustering PfamClans groups families into superfamilies Smart - smart.embl- heidelberg.de Manually curated, specializes in signalling, extracellular and chroma-n- associated proteins. ProDom - prodom.prabi.fr Automated clustering

Other mo-fs DNA/RNA/Protein binding mo-fs Transmembrane helices (TMH) Signal pep-des Pospransla-onal modifica-on (PTM) signals Secondary structure Disordered regions

Hidden Markov Models (HMMs) Different states, with different probabili-es of each symbol (aa or nt) at each state Transi-on probabili-es between states Insert states Silent states

HMM models: key concepts - No magic involved: just an extension of the profile - Enables modelling of deletions and insertions - Very useful for protein domains, HMMs for many different domain databases such as SCOP, Pfam etc. are available for download or web-based searches - Common programs to build HMMs from MSAs and scan sequence databases for matches are HMMER and SAM.

HMMER Program package by Sean Eddy (hmmer.janelia.org) Create HMMs from Mul-ple Sequence Alignment (MSA) Run searches with HMMs, ex. Pfam Requires a few unix commands Has webserver for homology searches

Membrane protein topology predic-on Predict transmembrane helices (TMH) based on hydrophobicity profile A TMH is normally about 20aa Reentrant regions can create mispredic-ons Posi-ve inside rule guides the direc-on O_en includes homology informa-on Most methods use HMMs for predic-on

Membrane protein topology predic-on hpp://topcons.cbr.su.se/

Predic-on of signal pep-des A signal pep-de is a short (3-60 amino acids long) pep-de chain that directs the post- transla-onal transport of a protein. Signal pep-des may also be called targe-ng signals, signal sequences, transit pep-des, or localiza-on signals.

Predic-on of cleavage site and localiza-on SignalP - predicts the presence and loca-on of signal pep-de cleavage sites in amino acid sequences from different organisms: Gram- posi-ve prokaryotes Gram- nega-ve prokaryotes Eukaryotes TargetP - predicts the subcellular loca-on of eukaryo-c proteins. The loca-on assignment is based on the predicted presence of any of the N- terminal presequences: chloroplast transit pep-de (ctp), mitochondrial targe-ng pep-de (mtp) or secretory pathway signal pep-de (SP). The methods combines predic-on from several ar-ficial neural networks and HMMs.

InterPro Integrates domain/mo-f predic-ons from several databases ProDom: sequence- clusters built from UniProtKB using PSI- BLAST. PROSITE paperns: simple regular expressions. PROSITE and HAMAP profiles: sequence matrices. PRINTS fingerprints, un- weighted Posi-on Specific Sequence Matrices (PSSMs). PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: hidden Markov models (HMMs). TMHMM, SignalP

Protein interac-on networks Yeast PPI (hpp://www.bordalierins-tute.com/) Connec-vity/degree = number of interac-on partners Hubs = highly connected proteins Scale- free topology most genes have low connec-vity, few have high. Mainly yeast2hybrid and tandem affinity purifica-ons

Protein networks Protein protein interac-ons (PPI) IntAct www.ebi.ac.uk/intact DIP - dip.doe- mbi.ucla.edu/ Pathways KEGG - www.genome.jp/kegg/ Biocarta - www.biocarta.com Reactome - www.reactome.org/ Regulatory networks

Predic-ng the func-on of a gene Expression papern From RNAseq or Microarrays Same expression papern - > involved in the same pathway or has similar func-on Homology interac-ons are o_en conserved between species

Predic-ng the func-on of a gene Phylogene-c profiles Same conserva-on papern - > involved in same pathways

Predic-ng the func-on of a gene Gene fusions (Rosepa stone theory) Fusion of two proteins that interact can be convenient since their expression can be co- regulated.

Predic-ng the func-on of a gene Genomic context Adjacent genes may be regulated together (operons in bacteria)

Predic-ng the func-on of a gene Automated liperature mining Genes that o_en are found in the same abstracts are more likely to interact or have related func-ons Gene-c interac-on Genes with similar knock- down phenotypes or rescuing phenotypes are likely to have similar func-ons

STRING hpp://string.embl.de A database that combines different predic-ons of func-onal links Includes experimental databases such as DIP, BIND, KEGG, Biocharta etc. Bayesian sta-s-cs with weigh-ng of the different data sources and valida-on against known interac-ons.

STRING hpp://string.embl.de

Gene ontology (GO) Func-onal classifica-on of genes/proteins All different databases and annotators use different defini-ons of the same func-on => creates a problem in bioinforma-cs. The GO Consor-um gives standardized annota-ons to genes and proteins with rela-onships between terms.

Gene ontology (GO) Divided into three categories: (for cytochrome C) Molecular func-on (electron transporter ac-vity) Cellular component (mitochondrial matrix) Biological process (oxida-ve phosphoryla-on) Uses Directed Acyclic Graphs (DAGs) Evidence codes, ex. TAS (Traceable author statement) or IEP (Inferred from expression papern)

Never trust a server blindly! Always do control experiments: PosiCve controls: submit sequences for which you know the right answer. NegaCve controls: random or shuffled sequences. Try several different methods and use the consensus