Homology and Information Gathering and Domain Annotation for Proteins

Outline Homology Information Gathering for Proteins Domain Annotation for Proteins Examples and exercises

The concept of homology The same organ in different animals under every variety of form and function. Richard Owen, 1843 http://bytesizebio.net/wp-content/uploads/2009/07/homology-limbs Homologous forelimbs

Homology Alikeness because of common ancestry Homology: The relationship of any two characters that have descended with divergence from a common ancestral character (common ancestry) Analogy: The relationship of any two characters that have descended convergently from unrelated ancestors (convergent evolution) Characters are at very different levels of biological organization, ranging from entire organs over genes and domains to single nucleotides Homology is a concept of quality (all-or-none) Homology is not precisely defined pterosaur bat bird http://upload.wikimedia.org/wikipedia/commons/3/38/homology.jpg Steven M. Carr, 2009 http://www.mun.ca/biology/scarr/molecular_homology_&_analogy.html

Subtypes of homology Three disjoint subtypes Orthology: Two homologous characters separated by a speciation event Paralogy: Two homologous characters arising from a duplication event Xenology: Two homologous characters whose history involves interspecies (horizontal) transfer of genetic material Horizontal transfer (Speciation) (Duplication) Walter M. Fitch,Trends in Genetics, 2000

Protein domain is a basic evolutionary module and an important unit of homology Definition: A polypeptide chain capable of autonomous folding Many proteins are multi-domain proteins Many domains are found in different contexts domain shuffling Exons in eukaryotic genomes often correspond to domains Therefore, protein classification schemes build on domains not on entire proteins Soding & Lupas, Bioessays, 2003

Assessment of homology in proteins Assessed by comparing their sequence, structure, and function Sequence similarity is the primary marker of homology Due to the relatively minor size of protein structure space, similar structures are more likely to originate by convergence However, structure diverges more slowly and therefore allows for the recognition of more distant relationships Functional residues within an active site are often the most highly conserved positions in a protein sequence Sequence Structure Function

Information gathering and domain annotation for proteins Databases and servers Domain annotation

A variety of databases enable information gathering about your protein of interest Run by different research institutions Allow for free information retrieval for academic purposes The spectrum ranges from broad all-around databases (Uniprot or NCBI) to databases that specialize in particular aspects (i.e. hierarchical structural classification)

The National Center for Biotechnology Information (NCBI) at the National Institute of Health in the US The NCBI advances science and health by providing access to biomedical and genomic information Contains numerous popular resources PubMed (life science literature) Sequences (whole genomes to individual proteins) Gene Expression data Taxonomy Numerous Tools, most importantly BLAST for homology detection A good starting point for an analysis

Protein classifications generate order among their tremendous diversity Sequence-based domain classifications (grouping is based on homology inferred by detectable sequence similarity): SMART: emphasizes on signaling domains, fast Pfam: a comprehensive database to classify newly found domains into domain families Structure-based classification schemes: CATH: Class Architecture Topology Homology SCOP: Structural Classification of Proteins Class Fold Superfamily Family Homology is not a criterion on all levels of classification In contrast to cellular life proteins are polyphyletic

Example 1: Annotate domains in LRRK2 (Human) Obtain sequence in FASTA 1 format from the NCBI 2 Enter name of the protein (LRRK2) in Uniprot 3 and see all the information one can retrieve there Put the sequence into domain databases like SMART 4 or Pfam 5 and mark the identified domains in your log file 1) FASTA: a widely used plain text file format for sequence data 2) NCBI: google ncbi or http://www.ncbi.nlm.nih.gov/ 3) UniProt: google uniprot or http://www.uniprot.org/ 4) SMART: google embl smart or http://smart.embl-heidelberg.de/ 5) Pfam: google pfam or http://pfam.sanger.ac.uk/

Example 2: Annotate domains in NarX (E. coli) 1) FASTA: a widely used plain text file format for sequence data 2) NCBI: google ncbi or http://www.ncbi.nlm.nih.gov/ 3) UniProt: google uniprot or http://www.uniprot.org/ 4) SMART: google embl smart or http://smart.embl-heidelberg.de/ 5) Pfam: google pfam or http://pfam.sanger.ac.uk/