Large-Scale Genomic Surveys

Size: px

Start display at page:

Download "Large-Scale Genomic Surveys"

Monica Lester
5 years ago
Views:

1 Bioinformatics ubtopics Fold Recognition econdary tructure Prediction Docking & Drug Design Protein Geometry tructural Informatics Homology Modeling equence Alignment tructure Classification Gene Prediction Function Classification Database Design Genome Annotation E-literature Expression Clustering Large-cale Genomic urveys 1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu

2 Databases ome pecific Informatics tools NCBI GenBank- Protein and DNA sequence NCBI Human Map - Human Genome Viewer NCBI Ensembl - Genome browsers for human, mouse, zebra fish, mosquito TIGR - The Institute for Genome Research wissprot - Protein equence and Function ProDom - Protein Domains Pfam - Protein domain families Proite - Protein equence Motifs Protein Data Base (PDB) - Coordinates for Protein 3D structures COP Database- Domain structures organized into evolutionary families HP - Domain database using Dali FlyBase WormBase PubMed / MedLine of Bioinformatics equence Alignment Tools BLAT Clustal MAs FATA PI-Blast Hidden Markov Models 3D tructure Alignments / Classifications Dali VAT PRIM CATH COP

3 Databases ome pecific Informatics tools NCBI GenBank- Protein and DNA sequence NCBI Human Map - Human Genome Viewer NCBI Ensembl - Genome browsers for human, mouse, zebra fish, mosquito TIGR - The Institute for Genome Research wissprot - Protein equence and Function ProDom - Protein Domains Pfam - Protein domain families Proite - Protein equence Motifs Protein Data Base (PDB) - Coordinates for Protein 3D structures COP Database- Domain structures organized into evolutionary families HP - Domain database using Dali CATH Database FlyBase WormBase PubMed / MedLine of Bioinformatics equence Alignment Tools BLAT Clustal MAs FATA PI-Blast Hidden Markov Models 3D tructure Alignments / Classifications Dali CATH COP VAT PRIM

4 Dynamic Programming Algorithm: Alternate Tracebacks Correspond to Alternative Alighments A B C - N Y R Q C L C R - P M A Y C Y N - R - C K C R B P

5 equence imilarity May Miss Functional Homologies Which Can Be Detected by 3D tructural Analysis % equence Identity } Twilight Homologous 3D tructure Non-homologous 3D tructure zone Residues Aligned Adapted from Chris ander

6 tructural Validation of Homology 19% eq ID Z = 12.2 Adenylate Kinase Guanylate Kinase

7 CspA Asp trna ynthetase taphylococcal Nuclease CspB Gene 5 ssdna Binding Protein Topoisomerase I

Protein Domains Independent Folding Units 50-350 residues Mean size -

8 Protein Domains Independent Folding Units residues Mean size residues Alpha folds; Beta Folds; Alpha+Beta Folds; Alpha/Beta Folds

9 COG 272, BRCT family P. Bork et al

10 CDH-4 CDH-3 Cadherin Proteins in Caenorhabditis elegans CDH CDH CDH CDH CDH CDH CDH CDH T01D Y37E11A.94.a 411 Cadherins Fat CG7749 CDH tan 3017 HMR-1a Ds 1223 HMR-1b CadN CG14900 CG3389 CG4655/CG4509 CG15511/CG7805 CG6445 CG7527 hg CG6977 CG11059 CG10421 Cadherin Proteins in Drosophila melanogaster ? ? ? Ret 518 CG10244/HD ignal peptide Cadherin EGF EGF_CA Laminin G Transmembrane Helix 7 Pass Transmembrane Domain HormR GP Merge Position Classic cytoplasmic domain Tyrosine Kinase cytoplasmic domain Type 1 Cytoplasmic domain Type 2 Cytoplasmic domain Other Cytoplasmic domain courtesy of C. Chothia

11 11 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Principal Protein Fold Classes All alpha All beta alpha + beta alpha / beta

12 Classification of Protein Folds - COP -CATH - DALI / FP

13 Most proteins in biology have been produced by the duplication, divergence and recombination of the members of a small number of protein families. courtesy of C. Chothia

14 Average Domain ize: 170 residues Domain Combinations in Genome equences In bacteria close to 1/3 of proteins consist of one domain and 2/3 consist of two or more domains. In eukaryotes close to 1/4 of proteins consist of one domain and 3/4 consist of two or more domains. courtesy of C. Chothia

15 COP - Protein Fold Hierarchy Manually Curated Database of Domain tructures Class - 5 Fold - ~500 uperfamily - ~ 700 Family ~ 1000 Family - domains with common evolutionary origin Homology: Derived by evolutionary divergence

16 Five Principal Fold Classes All α folds All β folds α + β folds α / β folds small irregular folds

17 COP the tructural Classification Of Proteins database This contains all proteins, and protein domains, of known structure classified in terms of their structure and evolutionary relationships. UPERFAMILY This database contains: (a) hidden Markov models (HMMs) of all the proteins and protein domains in COP (b) a list of the matches made by these HMMs to the sequences of 56 genomes classsified by family. courtesy of C. Chothia

18 UPERFAMILY matches to genome sequences Genes Genome hs at ce dm mk sc pa eo ec mu bs bh mb vc cc cs dr ss xf sa af ll nn ph hb nm pm mt tm pb mj hi sq cj ml hp aa tv hq ta cq cp cr tp cm ct bb rp mq mp uu bn mg Genomes courtesy of C. Chothia

19 UPERFAMILY Results for Buchnera and Human Genome equences Buchnera Humans Number of sequences equences matched by UPERFAMILY Coverage of genome 61% 41% Number of matched domains Number of families Mean family size Number of large families that form half the matched domains courtesy of C. Chothia

20 UPERFAMILY Results for Buchnera and Human Genome equences: Top Five Domain Families Buchnera P-loop containing nucleotide triphosphate hydrolases Nucleic acid binding proteins NAD-binding Rossman domains Nucleotidylyl transferases Class II aar synthetases Humans Classic zinc fingers Immunoglobulin superfamily P-loop containing nucleotide triphosphate hydrolases EGF/Laminin Cadherin courtesy of C. Chothia

21 30000 Eukaryotes Other families dm+ce+hs: 45 families at+dm+ce+hs: 56 families All: 381 families sc at dm ce hs courtesy of C. Chothia

22 Bacteria Total EC +B mk pa eo ec bs bh mb mu cc vc sm ca cs dr au ss sa af ll pm av xf st sr tm hb nn nm mt hi ml aa sq pb ph cj mj ap ta tv hq hp tp cp cr cq cm ct rp bb bn mq mp uu mg Genome courtesy of C. Chothia

23 CATH Protein Domain Database Partially Automatic Fold Classificaiton CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class(C), Architecture(A), Topology(T) and Homologous superfamily (H). Orengo, C.A., Michie, A.D., Jones,., Jones, D.T., windells, M.B., and Thornton, J.M. (1997) CATH- A Hierarchic Classification of Protein Domain tructures. tructure. Vol 5. No 8. p Pearl, F.M.G, Lee, D., Bray, J.E, illitoe, I., Todd, A.E., Harrison, A.P., Thornton, J.M. and Orengo, C.A. (2000) Assigning genomic sequences to CATH Nucleic Acids Research. Vol 28. No

24 CATH Protein Domain Database Partially Automatic Fold Classification Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically. Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. The topology level clusters structures according to their topological connections and numbers of secondary structures. The homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to topology families and homologous superfamilies are made by sequence and structure comparisons.

25 25 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Representations of Protein tuctures a - full atom b,c - strands / helices d - Topology diagrams

26 26 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu tructural Alignment of Two Globins

Hb tructure-based equence Alignments Alignment of Individual tructures Fusing into a ingle Fold Template Mb Hb VLPADKTNVKAAWGKVGAHAGEYGAEALERMFLFPTTKTYFPHF-DL-----HGAQVKGHGKKVADALTNAV.

27 Hb tructure-based equence Alignments Alignment of Individual tructures Fusing into a ingle Fold Template Mb Hb VLPADKTNVKAAWGKVGAHAGEYGAEALERMFLFPTTKTYFPHF-DL-----HGAQVKGHGKKVADALTNAV Mb VLEGEWQLVLHVWAKVEADVAGHGQDILIRLFKHPETLEKFDRFKHLKTEAEMKAEDLKKHGVTVLTALGAIL Hb AHVD-DMPNALALDLHAHKLRVDPVNFKLLHCLLVTLAAHLPAEFTPAVHALDKFLAVTVLTKYR Mb KK-KGHHEAELKPLAQHATKHKIPIKYLEFIEAIIHVLHRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Elements: Domain definitions; Aligned structures, collecting together Non-homologous equences; Core annotation Previous work: Remington, Matthews 80; Taylor, Orengo 89, 94; Artymiuk, Rice, Willett 89; ali, Blundell, 90; Vriend, ander 91; Russell, Barton 92; Holm, ander 93; Godzik, kolnick 94; Gibrat, Madej, Bryant 96; Falicov, F Cohen, 96; Feng, ippl 96; G Cohen 97; ingh & Brutlag, (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu tructure equence Core Core 2hhb HAHU - D M P N A L A L D L H A H K L - F - - R V D P V NKL L H C L L V T L A A H < HADG - D LPGA L A L D L H A YKL - F - - RV D PVNKLLHCL LVT L ACH HAT - D L P T A L A L D L H A H K L - F - - R V D P A NK L L H C I L V T L A C H HABOKA - D LPGA L D L D L H A H K L - F - - RV D PVNKLLHL LVT L A H HTOR - D L P H A L A L H L H A C Q L - F - - R V D P A Q L L G H C L L V T L A R H HBA_CAIMO - D I A G A L KL D L H A QKL - F - - R V D PVNKFLGHC F LVVVA I H HBAT_HO - E L P R A L A L R H R H V R E L - L - - R V D P A Q L L G H C L L V T P A R H 1ecd GGICE3 P N I E A D V NT F V A H K P R G - L - N - - T H D Q N N F R A G F V Y M K A H < CTTEE P N I G K H V DA L V A T H K P R G - F - N - - T H A QNN FRA A F I A Y L K G H GGICE1 P T I L A K A K D F G K H K R A - L - T - - P A Q D N F R K L V V Y L K G A 1mbd MYWHP - K - G HHE A E L K P L A Q H A T K H - L - H K I P I K Y E F I E A I I H V L H R < MYG_CAFI - K - G HHEAE I K PLAQH A TKH - L - H K IPIKYE F I EA I I H VLQK MYHU - K - G HHEAE I K PLAQH A TKH - L - H K IPVKYE F I EC I I Q VLQK MYBAO - K - G HHEA E I K P L A Q H A TKH - L - H K I P V K Y E L I E I I Q V L QK Consensus Profile - c - - d L P A E h p A h p h? H A? K h - h - d c h p h c Y p h h? C h L V v L h p p <

28 ome imilarities are Readily Apparent others are more ubtle Easy: Globins 125 res., ~1.5 Å Tricky: Ig C & V 85 res., ~3 Å Very ubtle: G3P-dehydrogenase, C-term. Domain >5 Å 28 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu

29 Automatically Comparing Protein tructures Given 2 tructures (A & B), 2 Basic Comparison Operations 1 Find an Alignment between A and B based on their 3D coordinates 2 Given an alignment optimally UPERIMPOE A onto B Find Best R & T to move A onto B 29 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu

30 Distance Matices Provide a 2D Represenation of the 3D tructure

31 Explain Concept of Distance Matrix on Blackboard N x N distance matrix Antiparallel beta strands Parallel beta strands Helices N dimensional space Metric matrix M ij = D ij2 -D io2 -D jo 2 M Eigenvectors (M = 3 for 3D structure)

32 DALI: Protein tructure Comparison by Alignment of Distance Matrices L. Holm and C. ander J. Mol. Biol. 233: 123 (1993) Generate Cα-Cα distance matrix for each protein A and B Decompose into elementary contact patterns; e.g. hexapeptidehexapeptide submatrices ystematic comparisons of all elementary contact patterns in the 2 distance matrices; similar contact patterns are stored in a pair list Assemble pairs of contact patterns into larger consistent sets of pairs (alignments), maximizing the similarity score between these local structures A Monte-Carlo algorithm is used to deal with the combinatorial complexity of building up alignments from contact patterns Dali Z score - number of standard deviations away from mean pairwise similarity value

35 Dali Domain Dictionary Deitman, Park, Notredame, Heger, Lappe, and Holm Nucleic Acids Res. 29: 5557 (2001) Dali Domain Dictionary is a numerical taxonomy of all known domain structures in the PDB Evolves from Dali / FP Database Holm & ander, Nucl. Acid Res. 25: (1997) Dali Domain Dictionary ept ,532 PDB enteries 17,101 protein chains 5 supersecondary structure motifs (attractors) 1375 fold types 2582 functional families 3724 domain sequence families

36 Explain Concept of Distance Matrix on Blackboard N x N distance matrix N dimensional space Metric matrix M ij = D ij2 -D io2 -D jo 2 Eigenvectors of metric matrix Principal component analysis

37 A Global Representation of Protein Fold pace Hou, ims, Zhang, Kim, PNA 100: (2003) Database of 498 COP Folds or uperfamilies The overall pair-wise comparisons of 498 folds lead to a 498 x 498 matrix of similarity scores ij s, where ij is the alignment score between the ith and jth folds. An appropriate method for handling such data matrices as a whole is metric matrix distance geometry. We first convert the similarity score matrix [ ij ] to a distance matrix [D ij ] by using D ij = max - ij, where max is the maximum similarity score among all pairs of folds. We then transform the distance matrix to a metric (or Gram) matrix [M ij ] by using M ij = D ij2 -D io2 -D jo 2 where D i0, the distance between the ith fold and the geometric centroid of all N = 498 folds. The eigen values of the metric matrix define an orthogonal system of axes, called factors. These axes pass through the geometric centroid of the points representing all observed folds and correspond to a decreasing order of the amount of information each factor represents.

38 A Global Representation of Protein Fold pace Hou, ims, Zhang, Kim, PNA 100: (2003)

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Structural Informatics Homology Modeling Sequence Alignment Structure Classification Gene