Proteins Central Dogma : DNA RNA protein Amino acid polymers - defined composition & order Perform nearly all cellular functions Drug Targets
Fold into discrete shapes. Proteins - cont. Specific shapes specific functions. >How do we determine the shape of a protein? >How does shape define function and influence drug action?
Revolutions in X-ray crystallography Getting faster - Hemoglobin: 30y! Your Favorite Protein: hours-weeks Not size limited - ribosome, viruses (> 2.5 MDa) Atomic resolution detail
The PDB surpasses >60,000 structures in 2009 Yearly Total...and is still growing! http://www.rcsb.org/pdb/statistics/contentgrowthchart.do?content=total&seqid=100
Genome sequencing discoveries Genes Human >35,000 Fly ~13,600 Flat worm >19,000 Plant 25,498 Yeast 6,400 >50 microbes 500-5,000 Viruses <10-100 Genomic data is growing even faster!
X-ray crystallography can produce molecular images (Source) Data Protein Crystal Structure Electron Model building density
Crystal growth - general requirements Need: Pure protein (>98%) Chemically, conformationally homogeneous sample Add: Precipitating agents (mild organics such as PEG or salts) Buffers, inorganic or organic salts Cofactors, ligands, chemical additives Perturb: Hydration state, temperature, solubility response Get: Random aggregate Amorphous precipitate (common) Ordered phase transition Crystals (hard, rare)
Crystallization methods Direct mixing Free interface diffusion Precipitant Protein Precipitant Wait (Microbatch) Dehydrate - wait (Vapor diffusion) Protein Movie courtesy Fluidigm, Inc.
Crystals 101 Crystals are: Ordered arrays of ~10 14 molecules ~25-80% water - similar to cells Native protein structure/activity retained in crystalline state
Data collection - experimental setup N2 stream (100 K) Diffracted X-rays Source Synchrotron Rotating anode Detector CCD Image plate Film Beam Optics (mirrors) Crystal (cryo-preserved)
Rotate the crystal to record all data - Oscillation 1º/frame Hexagonal lattice Each reflection (spot) arises from a set of Bragg planes Beam stop shadow Water ring
X-rays & crystals 101 Why use X-rays? X-rays are periodic waves (~1Å wavelength) Electrons (from protein atoms) scatter X-rays Scattering measurably perturbs incident X-ray properties Why use crystals? Crystals are periodic arrays of proteins Act as a micro-diffraction grating to constructively amplify scattering signal Scattered X-rays carry information about electron density distribution in a crystal
Crystal (Bravais) lattice types 14 lattice geometries can pack into repeating, 3D arrays
How do molecules pack? Unit cell - fundamental crystal repeat Asymmetric unit - minimal element within unit cell acted upon by symmetry operators Squiggles per: A.U. 1 1 1 1 U.C. 1 2 4 6
Symmetry and lattice type define Space Groups 230 groups 65 accessible to biomolecules (no mirror planes!) Can also have symmetry within an AU! 3 orthogonal 2-fold axes 222 point group - four AUs AUs http://neon.mems.cmu.edu/degraef/pointgroups/
X-rays are electromagnetic waves A simple wave: α=0 Can describe by: f(x)=fcos2π(hx+α) F Where: F=Amplitude λ=wavelength α=phase 1/λ=h
Fourier syntheses reconstruct electron density from diffraction data ρ(x) = Σ h F(h)cos2π(hx - α (h)) = sum of cosine terms Target function h F(h) α (h)/360 0 1 1 0 3-1/3 0.5 5 1/5 0 Sum - approximates target function More terms better approximation Concept of RESOLUTION
Diffraction data Onward to structure hkl = 18,17,0 hkl = 17,12,0 1) Collect data, index reflections (spots) - hkl terms ( addresses ) 2) Integrate spot intensities; calculate amplitudes ( F I) 3) Calculate scattered wave phases: Experimentally (deliberately modulate spot intensities): Heavy atom substitution (Multiple isomorphous replacement, MIR) Multiwavelength anomolous dispersion (MAD) Computationally (use prior model): Molecular replacement (MR)
Waves can be represented as vectors α=0 F F α 1/λ=h
Atomic scattering is additive F PH = F P + F H F F PH - F H = F P - F H F PH Identify heavy atom positions (HA xyz ), can calculate F H But, two possibilities for F PH in solving for F P!
Get third derivative! F PH = F P + F H F PH - F H = F P F PH2 = F P + F H2 F PH2 - F H2 = F P F PH2 F - F H2 F PH - F H With HA1 xyz, calculate F H1 With HA2 xyz, calculate F H2 Leaves only one possibility for F P!
Multiwavelength Anomalous Dispersion -- MAD 1. Derivatize YFP with heavy metal(s) (commonly SeMet) 2. Change wavelength to change X-ray absorbtion by metals (anomalous dispersion), Synchrotron needed 3. When x-rays are absorbed, F(hkl) F(-h-k-l) 4. Use anomalous differences, F(hkl) - F(-h-k-l), to locate metals 5. Calculate amplitude and phase of scattering from metals 6. Calculate probability of α P (hkl) 7. Each wavelength limits protein phases to 2 most probable values 8. Resolve phase ambiguity with: multiple wavelengths (MAD) solvent flattening (SAD) noncrystallographic symmetry averaging (model)...
Onward to structure Diffraction data Electron density FT 4) Apply Fourier synthesis to reconstruct electron density: Structure factor equation
Structure factor equation ρ(xyz) = 1/V ΣΣΣ F(hkl)cos2π(hx + ky + lz - α (hkl)) h k l ρ = electron density x, y, z = positions in crystalline repeat (fractional coordinates) V = unit cell volume F(hkl) = amplitude for reflection hkl h, k, l = integers, coordinates of each spot hx + ky + lz = counter through the unit cell α (hkl) = phase angle ( ), α, of spot hkl divided by 360 hkl s: 0 h 0 k 0 l z ρ=0 @ 2,1,5. x y ρ=6.7 @ 4,4,1
Resolution affects electron density interpretability Higher scattering angles add more spots (Fourier terms) Resolution information content 6 Å resolution 3 Å resolution Side chains evident at > 3.5 Å resolution
From maps to model Electron density Interpretation/model building 6) Thread amino acid sequence through electron density (manually or automatically) 7) Use amino acid shape and sequence as a guide (3D jigsaw) 8) Refine model computationally to find best match to data (F calc vs. F obs ) and optimize stereochemistry
Refinement Model F calc, α calc F obs Data Manual rebuilding Iterate until convergence F obs - F calc ; α calc Covalent geometry (Molecular dynamics) Shifts -- Δ x,y,z and B B ( temperature ) factor = disorder relative to a point atom
R and Rfree values -- the gold standards R = Σ F obs - F calc = 0.59 for random model ΣF obs = 0.4-0.55 for starting model > 0.25, good fit, errors still possible < 0.20, excellent fit R free = R value of a small, random subset never used in refinement. Ideally, Rfree< 0.30 & Rfree Rwork + 0.05 (this scales with resolution, however) Model is complete when: No interpretable difference electron density (F obs - F calc ) Geometry close to ideal No clashes, optimal rotamer stereochemistry
Interpreting the data - the structure table Data Collection Data set Remote Peak HgCl 2 derivative Space group P4 3 2 1 2 P4 3 2 1 2 P4 3 2 1 2 Unit cell a, b, c (Å) 52.8, 52.8, 160.1 52.8, 52.8, 160.1 52.1, 52.1, 162.9 α, β, γ ( ) 90, 90, 90 90, 90, 90 90, 90, 90 Wavelength (Å) 1.1000 0.9791 1.0093 Resolution range (Å) 45 2.4 45 2.4 43 3.1 Total reflections 206,536 72,214 41,453 Unique reflections 10,986 10,171 7,537 Redundancy 18.8 (15.5) a 7.1 (6.8) 5.5 (3.8) Completeness 99.9 (100) 99.5 (100) 98.6 (99.2) I/σ 43.3 (7.3) 41.7 (9.6) 34.9 (4.2) R sym (%) b 5.5 (30.8) 5.0 (20.0) 5.0 (37.4) a Values in parentheses are for highest-resolution shells. b I(h)j is the scaled observed intensity of the jth observation of reflection h, and <I(h)> is the mean value of corresponding symmetry-related reflections. (signal to noise) (data agreement) R sym = I(h) j < I(h) > / I(h) j j j
Interpreting the data - the structure table Refinement parameters Resolution (Å) 45 2.3 No. of nonhydrogen atoms 29,845 Rmsd No. of waters 243 Bond lengths (Å) 0.013 No. of ions 3 Bond angles ( ) 1.4 B factors Overall 30.1 Ramachandran Protein 29.2 Favored 90.4% (should be <0.02Å) (should be <2.0 ) (stereochemistry) Ligand/ion 39.5 Allowed 7.6% Water 34.6 Generous 2% R work /R e free 19.4/22.9 Disallowed 0 (model/data agreement) e where F obs and F calc are observed and model structure factors, respectively. R free was calculated by using a randomly selected set (5%) of reflections. R work = F obs F calc / F obs
Making sense of the PDB file - header info HEADER ISOMERASE 02-JUN-05 1ZVU TITLE STRUCTURE OF THE FULL-LENGTH E. COLI PARC SUBUNIT COMPND 2 MOLECULE: TOPOISOMERASE IV SUBUNIT A; SOURCE 2 ORGANISM_SCIENTIFIC: ESCHERICHIA COLI; KEYWDS BETA-PINWHEEL, ATPASE, SUPERCOILING, DECATENATION, DNA EXPDTA X-RAY DIFFRACTION AUTHOR K.D.CORBETT,A.J.SCHOEFFLER,N.D.THOMSEN,J.M.BERGER JRNL TITL THE STRUCTURAL BASIS FOR SUBSTRATE SPECIFICITY IN JRNL TITL 2 DNA TOPOISOMERASE IV. JRNL REF J.MOL.BIOL. V. 351 545 2005 REMARK 1 REMARK 2 REMARK 2 RESOLUTION. 3.00 ANGSTROMS. REMARK 3 REMARK 3 REFINEMENT. REMARK 3 PROGRAM : REFMAC 5.2.0005 REMARK 3 AUTHORS : MURSHUDOV,VAGIN,DODSON REMARK 3 REMARK 3 REFINEMENT TARGET : MAXIMUM LIKELIHOOD REMARK 3 REMARK 3 DATA USED IN REFINEMENT. REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 3.00 REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 20.00 REMARK 3 DATA CUTOFF (SIGMA(F)) : 1.000 REMARK 3 COMPLETENESS FOR RANGE (%) : 89.6 REMARK 3 NUMBER OF REFLECTIONS : 18167
Making sense of the PDB file - the guts SEQRES 1 A 716 MET ASP ARG ALA LEU PRO PHE ILE GLY ASP GLY LEU LYS SEQRES 2 A 716 PRO VAL GLN ARG ARG ILE VAL TYR ALA MET SER GLU LEU SEQRES 3 A 716 GLY LEU ASN ALA SER ALA LYS PHE LYS LYS SER ALA ARG HELIX 1 1 LYS A 39 SER A 50 1 12 HELIX 2 2 THR A 66 GLY A 72 1 Atom number 7 HELIX 3 3 ASP A 79 ALA A 90 1 12 Amino acid type SHEET 1 A 2 VAL A 100 GLY A 102 0 SHEET 2 A 2 SER A 123 LEU A 125-1 O ARG A 124 N ASP A 101 CRYST1 257.990 62.141 63.998 90.00 90.00 90.00 P 21 21 2 4 ORIGX1 1.000000 0.000000 0.000000 0.00000 ORIGX2 0.000000 1.000000 0.000000 0.00000 ORIGX3 0.000000 0.000000 1.000000 0.00000 SCALE1 0.003876 0.000000 0.000000 0.00000 Occupancy SCALE2 0.000000 0.016092 0.000000 0.00000 SCALE3 0.000000 0.000000 0.015625 0.00000 ATOM 1 N ASP A 28-7.840 5.599-4.925 1.00 35.20 N ATOM 2 CA ASP A 28-7.889 6.594-3.807 1.00 35.29 C ATOM 3 C ASP A 28-8.269 8.003-4.275 1.00 34.79 C ATOM 4 O ASP A 28-9.238 8.576-3.783 1.00 34.99 O ATOM 5 CB ASP A 28-6.550 6.628-3.059 1.00 35.74 C ATOM 6 CG ASP A 28-6.250 7.999-2.449 1.00 37.37 C ATOM 7 OD1 ASP A 28-6.857 8.351-1.406 1.00 38.79 O ATOM 8 OD2 ASP A 28-5.402 8.722-3.024 1.00 38.60 O ATOM 9 N ARG A 29-7.495 8.558-5.207 1.00 34.04 N TER END Atom identifier Protein chain ID Residue number Cell dimensions and space group B-factor Atomic position
Some things to keep in mind PDB file oddities: Occ<1 - partial occupancy, see for ligands sometimes B>>B avg - disordered region, interpret with caution Missing side chain or sequence gap - region not modeled, likely disordered Two copies of same amino acid - multiple conformations modeled Waters/ligands often at end of file The model is still a model: Best fit to data, doesn t mean everything is perfect or right Higher resolution models typically more accurate - use for homology modeling, molecular replacement, analysis of active site geometry, etc.
Representations of protein structure Ribbon representation: traces path of protein chain through space Surface representation: shows solid features of protein exterior Spheres and sticks: show atomic connections Remember - a model is still a model!
Where is crystallography headed? Dissect mechanism and catalysis Structure/function studies Time resolved reactions Harder problems Dynamic, metastable complexes and assemblies Membrane proteins Rational ligand/inhibitor design Define cellular proteome
Where do we need physics? Detectors Increase sensitivity, dynamic range, speed Sources Benchtop synchrotrons Overcome radiation damage Crystallization Develop rational guidelines & novel approaches Use of non-diffracting/poorly-diffracting crystals Functional prediction Extracting/simulating dynamics from models and data Docking/modeling interactions Single protein diffraction Free electron lasers Data analyses