Heteropolymer. Mostly in regular secondary structure

Heteropolymer - + + - Mostly in regular secondary structure 1 2 3 4

C >N trace how you go around the helix C >N C2 >N6 C1 >N5 What s the pattern? Ci>Ni+? 5 6 move around not quite 120 "#$%&'!()*(+2!3/'!4#5'!1/,#64!#6!,6! 7/'8#9!40#1:!;%0!0;!0/'!4#5'<!,4!=#'>'5!#6!,! &#??;6!5#,$&,-!4/;>6!;6!0/'!8'@0*!!A6!0/'!&#$/0!#4!,!/'8#1,8!>/''8!=#'>!;@!,6! 7/'8#9! >#0/!0/'!(B!&'4#5%'4!4/;>6!,?;='*!!3/#4!'9,-.8'!4/;>4!,6!,-./#.,0/#1!/'8#9<!>#0/!.;8,&!,65!6;67.;8,&!&'4#5%'4!;6!;..;4#0'!4#5'4*!!3/'!/'8#1,8!>/''8!#4!0,:'6!@&;-!,! C,=,!D..8'0!>&#00'6!?E!F5>,&5!G*!AHI'#8!,65!J/,&8'4!K*!L&#4/,-!MN6#='&4#0E!;@! O#&$#6#,!#6!J/,&8;00'4=#88'<!O#&$#6#,P2! /00.2QQ10#*#01*O#&$#6#,*FRNQS1-$QR'-;Q>/''8Q>/''8D..*/0-8*!, Each side can have different properties All of the amino acids are on the outside Gennis 1f3c 31-50 7 8

notice up-down-up-down the boxes show amino acids 9 10 11 12

Motifs Scop: Mruzin et at JMB 1995 (Janet Thornton) Cath: Orengo et al Structure 1997 (Cyrus Chothia) Each starts with domains 13 Proteins are made of domains. A domain is a structural and an evolutionary unit. They have 50-200 residues. Domains that are families or superfamilies come from a common ancestor. similar sequence - family diverged sequence but similar fold and function - superfamily Chothia and Gough (Biochem J (2009) 419, 15-28 14 www.sciencemag.org SCIENCE VOL 300 13 JUNE 2003 Figure 3. Glycosyl Hydrolases (A C) (A) (1/3)-b-glucanase (Varghese et al., 1994) represents the basic (Trans) glycosidases superfamily (c.1.8). Homologous catalytic domains are found in (B) b-glucuronidase and (C) b-galactosidase. (B) In b-glucuronidase (Jain et al., 1996), the catalytic domain is 3 (in red) and is joined by two other domains: 1 restricts the binding site, and 2 links 1 to 3. (C) b-galactosidase.the first three domains have the same structure as b-glucuronidase (Jacobson et al., 1994). Domain 4 links domain 3 to 5, which contributes to the active site. Bashton and Chothia: structure 15: 85-99 (2007) 15 16

Dominant mechanisms that produce new proteins are Duplication of the genes of old proteins divergence of these sequences to produce modified functions Some superfamilies have many protein domains found (9 take up 20% of the human genome) and others have few. There are 800-1000 superfamilies in animals; bacterial 250-700. combination of genes to further modify properties Many superfamilies are found in all kingdoms of life. Chothia and Gough (Biochem J (2009) 419, 15-28 17 Chothia and Gough (Biochem J (2009) 419, 15-28 18 Classification: based on structure and sequence Class (C-level): secondary structure composition and contacts. The first, most general level of the classification, class, describes the relative content of! helices and " sheets in a similar way to that described by Levitt and Chothia [29], except that we only define three major classes mainly!, mainly " and! ". Although the latter class can be sub- divided into alternating!/" and!+", in CATH, this information is considered at a lower level describing topology. Architecture (A-level): description of the gross arrangement of secondary structures, independent of connectivity This level distinguishes structures in the same class with different architectures, but does not distinguish between different topologies (connectivities). The architectural groupings can sometimes be rather broad as they describe general features of protein-fold shape, for example, the number of layers in an!-" sandwich. A given architecture will contain structures with diverse connectivities (see Figure 2) which will be distinguished at the next level down (topology). For example, in the!-" class (C = 3), there are two common architectures each containing a large number of different fold families. One is the barrel- like architecture (A = 20) adopted, (egtim-barrel folds). These have an inner " barrel and an outer layer of! helices (Figure 2). Alternatively, the three-layer!-" sandwich architecture (A = 40) consists of a central " sheet which is covered by a layer of! helices on both sides of the sheet (Figure 2). Topology (T-level): fold families Structures which are grouped at the T-level have the same overall fold, which means that they have a similar number and arrangement of secondary structures and that the connectivity linking their secondary structure elements is the same. In this paper, the words fold and topology have the same meaning. Proteins with the same CAT numbers have the same class, architecture and topology but do not necessarily belong to the same homologous superfamily.within a given topology level, the structures are similar, but may have diverse functions. Homologous superfamily (H-level): highly similar structures and functional similarity At the H-level, structures are grouped by their high structural similarity and similar functions, which suggest that they may have evolved from a common ancestor, particularly, where there are resemblances in core packing or putative active sites. Using the example of the mainly!.non-bundle. globin-like folds the erythrocruorins, colicins, phycocya- nins and domain 1 of diptheria toxin all have the same CAT number (1.10.340), but are differentiated by their H numbers 10, 20, 30 and 40, respectively (see Figure 3). Sequence family (S-level): significant sequence similarity and thus a high probability of having similar structure/function Members which are clustered at this level (having the same CATHS number) have sequence identities >35% and as such are presumed to have extremely similar structures and functions they may be slightly different examples of the same protein from different species belonging to the same sequence superfamily. 19 CATH Class: α,β,αβ Architecture: gross arrangement of 2 structure independent of connectivity Topology: Fold family linking of 2 structure Fold=Topology Homologous superfamily structure similar function similar Sequence family >35% identity Scop Class: α,β,αβ,α+β Fold same 2 structure elements same topology not related Superfamily Common evolutionary origin low seq identity Family >30% identical or >15% with same function 20

21 22 9-14 17-21 31-36 Illustration of motif overlaps in the mainly! 46-51 sandwich architecture. Each structure shown can be related to the central tenascin structure by a motif containing at56-60 least four! strands (although these are not sequential in the transthyretin structure) up to seven! strands in plastocyanin and the 66-76 immunoglobulin variable domain structures. It can be seen that this results in the possible merging of the immunoglobulin fold family 88-94 (2rhe) and the jelly-roll fold family (1tnfA) ure 1997, Vol 5 No 8 through overlap of a large motif containing five! strands. This is not currently done in C ATH, as both families are commonly referred to as separate folds in the literature. 1TTF.pdb SHEET 1 SHEET 2 SHEET 3 SHEET 1 SHEET 2 SHEET 3 SHEET 4 1 3 GLU A 9 THR A 14 0 1 3 SER A 17 ASP A 23-1 O SER A 21 N GLU A 9 1 3 THR A 56 SER A 60-1 N ALA A 57 O ILE A 20 2 4 GLN A 46 PRO A 51 0 2 4 TYR A 31 GLU A 38-1 N TYR A 36 O GLN A 46 2 4 VAL A 66 THR A 76-1 N VAL A 75 O TYR A 31 2 4 ILE A 88 THR A 94-1 N ILE A 88 O VAL A 72 Brandon and Tooze 23 24

25 26 SCOP: Structural Classification of Proteins. 1.75 release 38221 PDB Entries (23 Feb 2009). 110800 Domains. 1 Literature Reference (excluding nucleic acids and theoretical models) Class Number of folds Number of superfamilies Number of families All alpha proteins 284 507 871 All beta proteins 174 354 742 Alpha and beta proteins (a/b) 147 244 803 Alpha and beta proteins (a+b) 376 552 1055 Multi-domain proteins 66 66 89 Membrane and cell surface proteins 58 110 123 Small proteins 90 129 219 Total 1195 1962 3902 SCOP: Structural Classification of Proteins. 1.37 release 6497 PDB Entries (20 Oct 1997). 13073 Domains. 101 Literature References (including nucleic acids and theoretical models) Class Number of folds Number of superfamilies Number of families All alpha proteins 97 126 178 All beta proteins 61 112 163 Alpha and beta proteins (a/b) 75 110 188 Alpha and beta proteins (a+b) 101 146 201 Multi-domain proteins 20 20 25 Membrane and cell surface proteins 10 16 17 Small proteins 41 58 78 Total 405 558 850 http://scop.mrc-lmb.cam.ac.uk/scop/count.html#scop-1.75 27 28

CATH numbering scheme for representative structures from the globin-like fold family in the mainly " class. Four of the seven levels within the CATH database are shown, associated with Class, Architecture, Topology, and Homology. Each level is associated with a unique number. The (A), (T) and (H) levels are numbered in bins of ten to allow expansion of the database. Class Architecture Topology Homology 1 Mainly " 2 Mainly! 10 Non-bundle 3 "! 20 Bundle 460... 4 Few SS 30 Few SS 470 Variant surface glycoprotein... 480 Glucoamylase, domain 2 490 500 510 Globin-like!lactamase, domain 2 Casein kinase #.. 10 1hlm 20 1cpc chain A 520... 30 1col chain A 40 1ddt domain 2 1.10.490.20 Mainly ".Non-bundle.Globin-like.1cpc chain A 2BMF-Bovine ATPase F1 29 30 Table 1 The numbers of families identified at different levels in the CATH hierarchy is shown for the mainly ", mainly! and "$! classes. A T H S N I Domains Class Number % Number % Number % Number % Number % Number % Number % Mainly " 3 9.7 145 28.7 157 24.3 232 21.7 380 20.9 837 26.4 1793 22.2 Mainly! 17 54.8 102 20.2 137 21.2 266 24.9 585 32.1 891 28.1 2625 32.5 "! 10 32.3 244 48.3 337 52.2 556 52.1 829 45.5 1411 44.5 3562 44.1 Few SS* 1 3.2 14 2.8 14 2.2 14 1.3 27 1.5 32 1.0 98 1.2 Total 31 100.0 505 100.0 645 100.0 1068 100.0 1821 100.0 3171 100.0 8078 100.0 *The number of families for proteins having few secondary structure (SS) elements is also shown at each level in the hierarchy. Chain A 24-94 all beta 358-475 Left handed superhelix 95-379 P-loop containing nucleoside trip hydrolase http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?ver=1.75&key=1bmf Vogal et al: Current Opinion in Structural Biology 2004: 14: 208-216 31 32