Data quality indicators. Kay Diederichs

Similar documents
Linking data and model quality in macromolecular crystallography. Kay Diederichs

Data quality noise, errors, mistakes

Resolution and data formats. Andrea Thorn

X-ray Crystallography I. James Fraser Macromolecluar Interactions BP204

Twinning. Andrea Thorn

SHELXC/D/E. Andrea Thorn

CCP4 Diamond 2014 SHELXC/D/E. Andrea Thorn

Diffraction Geometry

Data Reduction. Space groups, scaling and data quality. MRC Laboratory of Molecular Biology Cambridge UK. Phil Evans Diamond December 2017

Experimental phasing, Pattersons and SHELX Andrea Thorn

RNA protects a nucleoprotein complex against radiation damage

Quantifying instrument errors in macromolecular X-ray data sets

Twinning in PDB and REFMAC. Garib N Murshudov Chemistry Department, University of York, UK

X-ray Data Collection. Bio5325 Spring 2006

Electron Crystallography for Structure Solution Latest Results from the PSI Electron Diffraction Group

X-ray Crystallography. Kalyan Das

8:10 pm Keynote Lecture: Michael G. Rossmann, Purdue University Current challenges in structural biology

X-ray Crystallography

Changing and challenging times for service crystallography. Electronic Supplementary Information

X-Ray structure analysis

X-Ray Damage to Biological Crystalline Samples

PSD '18 -- Xray lecture 4. Laue conditions Fourier Transform The reciprocal lattice data collection

Experimental Phasing of SFX Data. Thomas Barends MPI for Medical Research, Heidelberg. Conclusions

Calculating strategies for data collection

Protein crystallography. Garry Taylor

Ensemble refinement of protein crystal structures in PHENIX. Tom Burnley Piet Gros

Practical aspects of SAD/MAD. Judit É Debreczeni

The Reciprocal Lattice


Summary of Experimental Protein Structure Determination. Key Elements

Anisotropy in macromolecular crystal structures. Andrea Thorn July 19 th, 2012

Protein Crystallography

Basic Crystallography Part 1. Theory and Practice of X-ray Crystal Structure Determination

High-resolution crystal structure of ERAP1 with bound phosphinic transition-state analogue inhibitor

Why do We Trust X-ray Crystallography?

Biological Small Angle X-ray Scattering (SAXS) Dec 2, 2013

Scattering Lecture. February 24, 2014

Acta Crystallographica Section F

GC376 (compound 28). Compound 23 (GC373) (0.50 g, 1.24 mmol), sodium bisulfite (0.119 g,

IgE binds asymmetrically to its B cell receptor CD23

SOLID STATE 9. Determination of Crystal Structures

Protein Structure Determination. Part 1 -- X-ray Crystallography

Resolution: maximum limit of diffraction (asymmetric)

Characterization of low energy ionization signals from Compton scattering in a CCD Dark Matter detector

Biology III: Crystallographic phases

SUPPLEMENTARY INFORMATION

Electronic Supplementary Information (ESI) for Chem. Commun. Unveiling the three- dimensional structure of the green pigment of nitrite- cured meat

Table S1. Overview of used PDZK1 constructs and their binding affinities to peptides. Related to figure 1.

Non-merohedral Twinning in Protein Crystallography

Detection of X-Rays. Solid state detectors Proportional counters Microcalorimeters Detector characteristics

NIH Public Access Author Manuscript J Phys Conf Ser. Author manuscript; available in PMC 2015 June 01.

Web-based Auto-Rickshaw for validation of the X-ray experiment at the synchrotron beamline

Analytical Chemistry II

Pseudo translation and Twinning

research papers Detecting outliers in non-redundant diffraction data 1. Introduction Randy J. Read

C. Watson, E. Churchwell, R. Indebetouw, M. Meade, B. Babler, B. Whitney

SI Text S1 Solution Scattering Data Collection and Analysis. SI references

Crystal planes. Neutrons: magnetic moment - interacts with magnetic materials or nuclei of non-magnetic materials. (in Å)

Semiconductor Optoelectronics Prof. M. R. Shenoy Department of Physics Indian Institute of Technology, Delhi

Synthetic, Structural, and Mechanistic Aspects of an Amine Activation Process Mediated at a Zwitterionic Pd(II) Center

Error Reporting Recommendations: A Report of the Standards and Criteria Committee

General theory of diffraction

print first name print last name print student id grade

Detectors in Nuclear Physics: Monte Carlo Methods. Dr. Andrea Mairani. Lectures I-II

Direct Method. Very few protein diffraction data meet the 2nd condition

type GroEL-GroES complex. Crystals were grown in buffer D (100 mm HEPES, ph 7.5,

SOLID STATE 18. Reciprocal Space

I. INTRODUCTION AND HISTORICAL PERSPECTIVE

Data processing and reduction

Data Collection. Overview. Methods. Counter Methods. Crystal Quality with -Scans

Quantitative determination of the effect of the harmonic component in. monochromatised synchrotron X-ray beam experiments

Proteins. Central Dogma : DNA RNA protein Amino acid polymers - defined composition & order. Perform nearly all cellular functions Drug Targets

Experimental Phasing with SHELX C/D/E

ID14-EH3. Adam Round

AP5301/ Name the major parts of an optical microscope and state their functions.

X-ray Diffraction. Diffraction. X-ray Generation. X-ray Generation. X-ray Generation. X-ray Spectrum from Tube

Macromolecular X-ray Crystallography

Semiconductor Optoelectronics Prof. M. R. Shenoy Department of physics Indian Institute of Technology, Delhi

Scattering by two Electrons

F. Elohim Becerra Chavez

From x-ray crystallography to electron microscopy and back -- how best to exploit the continuum of structure-determination methods now available

Crystals, X-rays and Proteins

Determination of the Substructure

X-rays. X-ray Radiography - absorption is a function of Z and density. X-ray crystallography. X-ray spectrometry

Structure solution from weak anomalous data

Recognition of the amber UAG stop codon by release factor RF1

Setting The motor that rotates the sample about an axis normal to the diffraction plane is called (or ).

Schematic representation of relation between disorder and scattering

Sigma Bond Metathesis with Pentamethylcyclopentadienyl Ligands in Sterically. Thomas J. Mueller, Joseph W. Ziller, and William J.

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years.

Swanning about in Reciprocal Space. Kenneth, what is the wavevector?

Fast, Intuitive Structure Determination IV: Space Group Determination and Structure Solution

X-Ray Emission and Absorption

Small-Angle X-ray Scattering (SAXS) SPring-8/JASRI Naoto Yagi

Semiconductor Physics and Devices

Supporting Information

Structure of Materials Prof. Anandh Subramaniam Department of Material Science and Engineering Indian Institute of Technology, Kanpur

A Primer in X-ray Crystallography for Redox Biologists. Mark Wilson Karolinska Institute June 3 rd, 2014

arxiv:quant-ph/ v2 7 Nov 2001

Transcription:

Data quality indicators Kay Diederichs

Crystallography has been highly successful Now 105839 Could it be any better? 2

Confusion what do these mean? CC1/2 Rmerge Rsym Mn(I/sd) I/σ Rmeas CCanom Rpim Rcum 3

Topics Signal versus noise Random versus systematic error Accuracy versus precision Unmerged versus merged data R-values versus correlation coefficients Choice of high-resolution cutoff 4

signal vs noise easy hard threshold of solvability impossible James Holton slide5

noise : what is noise? what kinds of errors exist? noise = random error + systematic error random error results from quantum effects systematic error results from everything else: technical or other macroscopic aspects of the experiment 6

Random error (noise) Statistical events: photon emission from xtal photon absorption in detector electron hopping in semiconductors (amplifier etc) 7

non-obvious Systematic errors (noise) beam flicker (instability) in flux or direction shutter jitter vibration due to cryo stream split reflections, secondary lattice(s) absorption from crystal and loop radiation damage detector calibration and inhomogeneity; overload shadows on detector deadtime in shutterless mode imperfect assumptions about the experiment and its geometric parameters in the processing software... 8

non-obvious Adding noise 2 2 2 2 2 2 1 ++ 11 = 1.4 3 + 1 = 3.2 2 2 10 + 1 = 10.05 2 2 = 2 σ1 + σ2 σtotal 2 James Holton slide 14

This law is only valid if errors are independent! 15

non-obvious How do random and systematic error depend on the signal? random error obeys Poisson statistics error = square root of signal Systematic error is proportional to signal error = x * signal (e.g. x=0.02... 0.10 ) (which is why James Holton calls it fractional error ; there are exceptions) 16

non-obvious Consequences need to add both types of errors at high resolution, random error dominates at low resolution, systematic error dominates but: radiation damage influences both the low and the high resolution (the factor x is low at low resolution, and high at high resolution) 17

non-obvious How to measure quality? B. Rupp, Biomolecular Crystallography Accuracy Precision how close to the true value? how close are measurements? 18

non-obvious What is the true value? if only random error exists, accuracy = precision (on average) if unknown systematic error exists, true value cannot be found from the data themselves a good model can provide an approximation to the truth model calculations do provide the truth consequence: precision can easily be calculated, but not accuracy accuracy and precision differ by the unknown systematic error All data quality indicators estimate precision (only), but YOU want to know accuracy! 19

Numerical example Repeatedly determine π=3.14159... as 2.718, 2.716, 2.720 : high precision, low accuracy. Precision= relative deviation from average value= (0.002+0+0.002)/(2.718+2.716+2.720) = 0.049% Accuracy= relative deviation from true value= (3.14159-2.718) / 3.14159 = 13.5% Repeatedly determine π=3.14159... as 3.1, 3.2, 3.0 : low precision, high accuracy Precision= relative deviation from average value= (0.04159+0+0.05841+0.14159)/(3.1+3.2+3.0) = 2.6% Accuracy= relative deviation from true value: 3.14159-3.1 = 1.3% 20

Calculating the precision of unmerged data Precision indicators for the unmerged (individual) observations: <Ii/σi > (σi from error propagation) n I i ( hkl ) I ( hkl ) R merge= hkl i=1 n I i (hkl ) hkl i=1 R meas = hkl n n I ( hkl ) I ( hkl ) n 1 i= 1 i n I i (hkl ) Rmeas ~ 0.8 / <Ii/σi > hkl i= 1 21

Averaging ( merging ) of observations Intensities: I = Ii/σi2 / 1/σi2 Sigmas: σ2 = 1 / 1/σi2 (see Wikipedia: weighted mean ) 22

non-obvious Merging of observations may improve accuracy and precision Averaging ( merging ) requires multiplicity ( redundancy ) (Only) if errors are unrelated, averaging with multiplicity n decreases the error of the averaged data by sqrt(n) Random errors are unrelated by definition: averaging always decreases the random error of merged data Averaging may decrease the systematic error in the merged data. This requires sampling of its possible values - true multiplicity If errors are related, precision improves, but not accuracy 23

Calculating the precision of merged data using the sqrt(n) law: <I/σ(I)> n 1 / n 1 I i ( hkl ) I ( hkl ) R pim = hkl i=1 n I i ( hkl ) Rpim ~ 0.8 / <I/σ > hkl i= 1 by comparing averages of two randomly selected half-datasets X,Y: H,K,L 1,2,3 1,2,4 1,2,5... Ii in order of Assignment to measurement 100 110 120 90 80 100 50 60 45 60 1000 1050 1100 1200 half-dataset X, X, Y, X, Y, Y YXYX XYYX Average I of X Y 100 100 60 47.5 1100 1075 (calculate the R-factor (D&K1997) or correlation coefficient (K&D 2012) on X, Y ) 24

I/σ with σ2 = 1 / 1/σi2 25

Shall I use an indicator for precision of unmerged data, or of merged data? It is essential to understand the difference between the two types, but you don't find this in the papers / textbooks! Indicators for precision of unmerged data help to e.g. * decide between spacegroups * calculate amount of radiation damage (see XDS tutorial) Indicators for precision of merged data assess suitability * for downstream calculations (MR, phasing, refinement) 26

Crystallographic statistics - which indicators are being used? Data R-values: Rpim < Rmerge=Rsym < Rmeas n R pim = 1 /n 1 d Iatia( hkl ) I ( hkl ) hkl edi= 1 merg n I i ( hkl ) hkl i= 1 n I i ( hkl ) a It(ahkl ) ed d g r n e unm I i ( hkl ) Rmerge = hkl i=1 Rmeas = hkl dd g r ne e unm I i ( hkl ) hkl i=1 Model R-values: Rwork/Rfree n n I ( hkl ) I ( hkl ) n 1 i=1 i ata hkl i= 1 (hkl ) F obs ( hkl ) F ta calc a R work / free= hkl rged d me F obs (hkl ) hkl I/σ (for unmerged or merged data!) CC1/2 and CCanom for the merged data 27

Decisions and compromises Which high-resolution cutoff for refinement? Higher resolution means better accuracy and maps But: high resolution yields high Rwork/Rfree! Which datasets/frames to include into scaling? Reject negative observations or unique reflections? The reason why it is difficult to answer R-value questions is that no proper mathematical theory exists that uses absolute differences; concerning the use of R-values, Crystallography is disconnected from mainstream Statistics 28

Improper crystallographic reasoning typical example: data to 2.0 Å resolution using all data: R work=19%, Rfree=24% (overall) cut at 2.2 Å resolution: R work=17%, Rfree=23% cutting at 2.2 Å is better because it gives lower R-values 31

Proper crystallographic reasoning 1. Better data allow to obtain a better model 2. A better model has a lower Rfree, and a lower Rfree-Rwork gap 3. Comparison of model R-values is only meaningful when using the same data 4. Taken together, this leads to the paired refinement technique : compare models in terms of their R-values against the same data. 32

Example: Cysteine DiOxygenase (CDO; PDB 3ELN) re-refined against 15-fold weaker data Rmerge Rpim Rfree Rwork I/sigma 33

Is there information beyond the conservative hi-res cutoff? Paired refinement technique : refine at (e.g.) 2.0Å and at 1.9Å using the same starting model and refinement parameters since it is meaningless to compare Rvalues at different resolutions, calculate the overall R-values of the 1.9Å model at 2.0Å (main.number_of_macro_cycles=1 strategy=none fix_rotamers=false ordered_solvent=false) ΔR=R1.9(2.0)-R2.0(2.0) 34

Measuring the precision of merged data with a correlation coefficient Correlation coefficient has clear meaning and well-known statistical properties Significance of its value can be assessed by Student's ttest (e.g. CC>0.3 is significant at p=0.01 for n>100; CC>0.08 is significant at p=0.01 for n>1000) Apply this idea to crystallographic intensity data: use random half-datasets CC1/2 (called CC_Imean by SCALA/aimless, now CC1/2 ) From CC1/2, we can analytically estimate CC of the merged dataset against the true (usually unmeasurable) intensities using 2 CC 1/ 2 CC*= 1+CC 1/2 (Karplus and Diederichs (2012) Science 336, 1030) 35

Data CCs CC1/2 CC* Δ I/sigma 36

Model CCs We can define CCwork, CCfree as CCs calculated on Fcalc2 of the working and free set, against the experimental data CCwork and CCfree can be directly compared with CC* CC* Dashes: CCwork, CCfree against weak exp tl data Dots: CC work, CC free against strong 3ELN data 37

Four new concepts for improving crystallographic procedures Image courtesy of 38 P.A. Karplus

Summary To predict suitability of data for downstream calculations (phasing, MR, refinement), we should use indicators of merged data precision Rmerge should no longer be considered as useful for deciding e.g. on a high-resolution cutoff, or on which datasets to merge, or how large total rotation I/σ has two drawbacks: programs do not agree on σ, and its value can only rise with multiplicity CC1/2 is well understood, reproducible, and directly links to model quality indicators 39

References P.A. Karplus and K. Diederichs (2012) Linking Crystallographic Data with Model Quality. Science 336, 1030-1033. see also: P.R. Evans (2012) Resolving Some Old Problems in Protein Crystallography. Science 336, 986-987. K. Diederichs and P.A. Karplus (2013) Better models by discarding data? Acta Cryst. D69, 1215-1222. P. R. Evans and G. N. Murshudov (2013) How good are my data and what is the resolution? Acta Cryst. D69, 1204-1214. Z. Luo, K. Rajashankar and Z. Dauter (2014) Weak data do not make a free lunch, only a cheap meal. Acta Cryst. D70, 253-260. J. Wang and R. A. Wing (2014) Diamonds in the rough: a strong case for the inclusion of weak-intensity X-ray diffraction data. Acta Cryst. D70, 1491-1497. Diederichs K, "Crystallographic data and model quality" in Nucleic Acids Crystallography. (Ed. E Ennifar), Methods in Molecular Biology (in press). 40

Thank you! PDF available pls send email to kay.diederichs@uni-konstanz.de 41