Supporting online material Materials and Methods Target proteins All predicted ORFs in the E. coli genome (1) were downloaded from the Colibri data base (2) (http://genolist.pasteur.fr/colibri/). 737 proteins longer than 100 residues and with 2 or more transmembrane helices predicted by TMHMM (3) (v. 2.0) were retained. 23 of the corresponding genes contained restriction sites that prevented cloning into either of the three standard phoa or two gfp vectors (4) used, reducing the final collection of target proteins to 714. Cloning ORFs encoding the selected membrane proteins were amplified by PCR from the E. coli strain MG1655 (1). Three different combinations of primer-introduced restriction sites and correspondingly digested vectors were used: (i) 5 XhoI / 3 KpnI, (ii) 5 XhoI / 3 BamHI, (iii) 5 NdeI / 3 BamHI. Cloning was performed in E. coli strains MC1061 or TOP10F. The three phoa fusion vectors contain the gene of interest, followed by a short linker sequence and the region coding for the phoa gene. The gfp fusion vectors contain the gene of interest, followed by a linker sequence encoding a TEV protease site, the gfp gene (S65T, F64L + Cycle 3 mutant) and a His 8 tag at the 3 end. All genes are preceded by the same ribosome binding site and the start codon is always ATG. The vectors and enzymes used are described in detail elsewhere (4). All constructs were confirmed by sequencing from both the 5 and 3 ends. Protein expression and experimental determination of C-terminal locations PhoA and GFP assays were repeated at least 3, but in general 4 times. Constructs encoding PhoA fusions were transformed into the CC118 strain and assayed as described previously (4). Cell density and PhoA activity were measured using a SpectraMaxPlus384 (Molecular Devices, California). Constructs encoding GFP fusions were transformed into the BL21(DE3)pLysS strain and assayed as described (4, 5) with minor adaptations. Overnight cultures were back-diluted into 5 ml of LB media in 24 well growth plates. Cultures were grown at 37 C to an OD 600 of approximately 0.4-0.6, then induced with 0.4 mm IPTG, and grown for an additional 2 h. The cell pellet was resuspended in GFP resuspension buffer (50 mm Tris-HCl ph 8.0, 200 mm NaCl, 15 mm EDTA), incubated at room temperature for 2 h, and assayed for GFP fluorescence. Fluorescence was measured with an excitation wavelength of 485 nm, emission wavelength of 512 nm, and a 495 nm cutoff filter in a SpectraMaxGemini EM (Molecular Devices, California).
Normalization of obtained values was carried out to allow for a quantitative comparison between the GFP and PhoA measurements. The raw GFP activity value for each fusion was first divided by the cell density (OD 600 ). PhoA and GFP activities were then divided by the median activity of the active PhoA or GFP fusions (347 units for PhoA, 3924 arbitrary units for GFP). Cutoff values for C-terminal assignments were determined by the following procedure (see Fig. 1 in the main text): First, all pairs of C-terminally aligned homologues among the 573 proteins for which both PhoA and GFP clones were available were identified. Homologues were defined by proteins with a pair-wise BLAST E-value < 10-4 and for which the BLAST alignment reached to within 25 residues of the C-termini to ensure that no extra C-terminal TMHs are present in one or the other protein. To define the cutoffs, the two 45 o lines in Fig. 1 were moved symmetrically towards the main diagonal; for each location of the lines (defined by the intersections (a,0) and (0,a) with the x- and y-axis), all proteins located above the upper cutoff line were assigned as C in, and all proteins located below the lower cutoff line were assigned as C out. The a-value was reduced from its starting value a = 1.5 until a pair of C-terminally aligned homologues (as defined above) was found where one was assigned C out and the other C in ; this happened for a = 0.2 (the YdgQ-YdgL pair (6) and proteins in the SMR family were excluded), and the final cutoff value was set to a = 0.3. For proteins where only one fusion (PhoA or GFP) was available, the cutoff value was set to a = 0.75, as < 1% of the proteins for which both PhoA and GFP fusions were available would have been mis-assigned with this cutoff had only one of the two fusions been available, Fig. 1. Topology prediction Unconstrained topology predictions were done using TMHMM (3) (v. 2.0). Constrained topology predictions were done as described (7) by fixing the C-terminus of the protein to its experimentally determined location before running TMHMM (http://www.sbc.su.se/tmhmmfix/). S3 reliability scores were calculated as described (7). Functional assignments Proteins were grouped into one of nine functional categories (biogenesis (B), channel (C), transport/efflux (E) flagellar (F), lipid (L), bioenergetics/metabolism (M), signaling (S), transport/influx (T) or unknown (U)) depending on their known or predicted function. Initially, functional annotations were collected from the Colibri (2) (http://genolist.pasteur.fr/colibri/) and SwissProt (8) (http://www.expasy.org/sprot/) databases. Those proteins whose function was still unknown were then searched against the literature and were assigned to the functional categories based on published information.
References 1. F. R. Blattner et al., Science 277, 1453 (1997). 2. C. Medigue, A. Viari, A. Henaut, A. Danchin, Microbiol Rev 57, 623 (1993). 3. A. Krogh, B. Larsson, G. von Heijne, E. Sonnhammer, J Mol Biol 305, 567 (2001). 4. M. Rapp et al., Prot Sci 13, 937 (2004). 5. D. Drew et al., Proc Natl Acad Sci USA 99, 2690 (2002). 6. A. Sääf, M. Johansson, E. Wallin, G. von Heijne, Proc Natl Acad Sci USA 96, 8540 (1999). 7. K. Melén, A. Krogh, G. von Heijne, J Mol Biol 327, 735 (2003). 8. A. Bairoch, B. Boeckmann, Nucl Acids Res 19, 2247 (1991).
Figure S1. TMHMM topology prediction for YjfL before (top) and after (bottom) the C-terminus has been fixed to its experimentally determined location (www.sbc.su.se/tmhmmfix/). Probabilities for inside loop is in blue, for outside loop in pink, and for transmembrane helix in red. The overall reliability score (S3) is shown (K. Melén, A. Krogh, G. von Heijne, J Mol Biol 327, 735 (2003).).
Figure S2. Normalised PhoA and GFP activity data for dual topology candidates and pairs of homologous proteins with opposite topologies (connected by lines), c.f. Fig. 1 in the main text. The YdgQ-YdgL pair has been described earlier (A. Sääf, M. Johansson, E. Wallin, G. von Heijne (1999) Proc Natl Acad Sci USA 96, 8540).
Table S2. Sequence characteristics and their correlation coefficients (R) against overexpression levels (GFP/ml) and ΔOD 600. GFP/ml ΔOD 600 Sequence Length 0-0.15 Number of TM helices 0.09-0.26 Reliability score -0.06 0.13 Membrane content (residues in membrane/sequence length) 0.01-0.18 Length of N-tail (residues before first TM) 0.05 0.14 Length of N-tail/sequence length 0.09 0.21 Average hydrophobicity, GES scale (Engelman, D.M., Steitz,T.A. & Goldman, A. Annu. Rev. Biophys. Biophys. Chem. 1986 15: 321-353) 0.08 0.24 Minimum hydrophobicity (19 consecutive most hydrophilic residues) 0.15 0.21 Max hydrophobic region (41 residues) 0.12 0.16 Min hydrophobic region (41 residues) -0.03 0.11 Average Codon Usage (CU) (Calculated using GenBank release 140, Nucl Acids Res 2004 32:23-26) E. coli K12 CU table from http://www.kazusa.or.jp/codon/e.html -0.02-0.11 Min CU over 5 codons 0.01-0.02 Min CU over 10 codons 0.03 0.06 Average CU of first 40 codons -0.05-0.11 Average CU of first 20 codons -0.09-0.07 Positive residues in N-tail 0.06 0.13 Negative residues in N-tail 0.06 0.16 Length of longest inside loop (The longest loop between two TMs that is predicted to be on the inside) -0.1-0.11 Length of longest outside loop -0.2-0.14 Length of longest inside loop/sequence length -0.1-0.07 Length of longest outside loop/sequence length -0.25-0.11 Average hydrophobicity of N-tail 0.08 0.01 Positive residues inside (number of positive amino acids predicted to be inside) 0.14 0.11 Negative residues inside 0.1 0.16 Pos residues inside/length 0.25 0.35 Neg residues inside/length 0.19 0.31 Positive residues outside -0.18-0.15 Negative residues outside -0.19-0.21 Pos residues outside/length -0.21-0.13 Neg residues outside/length -0.24-0.21 Pos residues in longest inside loop -0.1-0.13 Neg residues in longest inside loop -0.06-0.12 Pos residues in longest outside loop -0.17-0.09 Neg residues in longest outside loop -0.18-0.12