Homology and Information Gathering and Domain Annotation for Proteins
Outline WHAT IS HOMOLOGY? HOW TO GATHER KNOWN PROTEIN INFORMATION? HOW TO ANNOTATE PROTEIN DOMAINS? EXAMPLES AND EXERCISES
Homology or Why knowledge transfer between organisms works
The concept of homology The same organ in different animals under every variety of form and function. Richard Owen, 1843 Homologous forelimbs http://bytesizebio.net/wp-content/uploads/2009/07/homology-limbs
Homology Alikeness because of common ancestry HOMOLOGY The relationship of any two characters that have descended with divergence from a common ancestral character ANALOGY The relationship of any two characters that have descended convergently from unrelated ancestors pterosaur bat bird http://upload.wikimedia.org/wikipedia/commons/3/38/homology.jpg CHARACTERS Are on very different levels of biological organization, e.g. entire organs, genes, domains, single nucleotides Homology is a concept of quality (all-or-none) Steven M. Carr, 2009 http://www.mun.ca/biology/scarr/molecular_homology_&_analogy.html
Three homology subtypes ORTHOLOGY Two homologous characters separated by a speciation event PARALOGY Two homologous characters arising from a duplication event XENOLOGY Two homologous characters whose history involves inter-species (horizontal) transfer of genetic material Horizontal transfer (Speciation) (Duplication) Walter M. Fitch, Trends in Genetics, 2000
Protein domains as evolutionary modules and homology units PROTEIN DOMAIN A polypeptide chain capable of autonomous folding. Many proteins comprise multiple domains Many domains are found in different contexts domain shuffling Most classification schemes build on domains not on entire proteins Söding & Lupas, Bioessays, 2003
Homology assessment in proteins is similarity based SEQUENCE SIMILARITY The primary marker of homology as sequence constantly changes STRUCTURAL SIMILARITY Similar structures are more likely to originate by convergence, due to the relatively minor size of protein structure space But structure diverges slower and thus helps to recognize more distant relationships FUNCTIONAL RESIDUES Found within active sites are often the most highly conserved positions in a protein sequence 1. Sequence 2. Structure 3. Function
Homology what s in it for me? Works across species borders The rationale behind using model organisms Transfer knowledge between proteins A good starting point before any experiment Improved experimental results E.g. improve thermostability by using homolog from thermophilic organism
Protein databases and analysis servers or How to exploit existing knowledge
Current knowledge on proteins in online databases Offered by different research institutions Free information retrieval for academic purposes From broad all-around databases (e.g. Uniprot and NCBI) to databases specialized in particular aspects (e.g. hierarchical structural classification)
The National Center for Biotechnology Information (NCBI) The NCBI advances science and health by providing access to biomedical and genomic information. www.ncbi.nlm.nih.gov numerous popular resources PubMed (life science literature) Sequences (whole genomes to individual proteins) Gene Expression data Taxonomy Numerous tools, most importantly BLAST for homology detection A good starting point for an analysis
The Universal Protein Resource (UniProt) The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. www.uniprot.org UniProtKB (knowledge base) Swiss-Prot TrEMBL manually annotated and reviewed automatically annotated and is not reviewed
Classifications order proteins to tame their tremendous diversity SEQUENCE-BASED grouping is based on homology inferred by detectable sequence similarity Pfam comprehensive database classifying newly found domains into families SMART annotation of known domains in proteins STRUCTURE-BASED grouping mixes structural features (i.e. analogy) and homology CATH SCOP Class Architecture Topology Homology Structural Classification Of Proteins Class Fold Superfamily Family
Example: Annotate domains in LRRK2 (Human) 1. Enter name of the protein (LRRK2) in UniProt 1 and explore the retrieved information 2. Obtain sequence in FASTA 2 format from the NCBI 3 3. Put the sequence into domain databases like SMART 4 or Pfam 5 and mark the identified domains in your log file 1) UniProt google uniprot or www.uniprot.org 2) FASTA a widely used plain text file format for sequence data 3) NCBI google ncbi or www.ncbi.nlm.nih.gov 4) SMART google embl smart or smart.embl-heidelberg.de 5) Pfam google pfam or pfam.sanger.ac.uk
Exercise: Annotate domains in NarX (E. coli) 1. 1) UniProt google uniprot or www.uniprot.org 2) FASTA a widely used plain text file format for sequence data 3) NCBI google ncbi or www.ncbi.nlm.nih.gov 4) SMART google embl smart or smart.embl-heidelberg.de 5) Pfam google pfam or pfam.sanger.ac.uk