Scientific Integrity: A crystallographic perspective Ian Bruno - Director, Strategic Partnerships The Cambridge Crystallographic Data Centre @ijbruno @ccdc_cambridge Scientific Integrity: Can We Rely on the Published Scientific Literature? 250 th ACS National Meeting & Exposition, August 16-20 2015, Boston, MA www.ccdc.cam.ac.uk 1
http://blogs.scientificamerican.com/absolutely-maybe/generation-open-sneak-peek-into-science-8217-s-future-at-opencon-2014/ 2
http://researchdata.ox.ac.uk/2014/09/30/report-from-the-research-data-alliance-plenary-meeting-no-4/ The research paper should be considered supplementary to the data Publications are not simply containers for data but rather arguments that are supported by data Barend Mons (Leiden) Christine Borgman (UCLA) supplementary (ˌsʌp ləˈmɛn tə ri) 1. Forming or acting as a supplement. http://www.thefreedictionary.com/supplementary supplement (sŭp lə-mənt) 1. Something added to complete a thing, make up for a deficiency, or extend or strengthen the whole. Ergo: A scientific article without supplementary data is incomplete, deficient or weak! 3
Can we rely on the published scientific literature? Can we rely on the published scientific data? 4
Crystal Structure Databases Cambridge Structural Database organic and metal-organic compounds 790,040 structures Growth of the CSD 5
Data Deposition and Access CIF file CCDC Structure Summary Page Many journals require derived data to be deposited with the CCDC prior to publication Data files are available to reviewers pre-publication and to everyone post-publication http://www.ccdc.cam.ac.uk/getstructures 6
Scientific Validation: checkcif http://checkcif.iucr.org/ Checks consistency and integrity of the data Generates alerts indicating issues that should be corrected or explained Can be run interactively via a web form A checkcif API is now also available 7
Publisher Policies: checkcif Most publishers require checkcif to be run prior to submission Some require report to be uploaded - others for it to be retained Some specifically request a PDF of the checkcif report Stringency varies depending on the journal Some require certain alerts to be justified checkcif reports go to the publisher, data files go to the CCDC Based on a review of Author Guidleines, June 2014 8
checkcif Validation Responses Voids due to exclusion of unknown solvent checkcif alerts and researcher response can be embedded in CIFs Disorder in counter-ion CCDC 813412 9
checkcif comments in the CCDC CIF Repository Look for data items beginning _vrf (Validation Response Form) Subset of around 480,000 deposited CIFs Around 8,000 CIFs contain validation responses (~1.5%) Indicates the number of CIFs where checkcif comments have been added at the point they are deposited with the CCDC. Not necessarily a reflection of how often checkcif is run. Frequently observed explanations for common alerts: Disorder Quality of sample Weak diffraction Limited beam time Water hydrogens hard to locate Modelling of solvent molecules Restraint strategy used to refine model Twinned pyrite crystal "Pyrite 60608" by Vassil. Licensed under Public Domain via Wikimedia Commons 10
Opportunities checkcif provides useful information about deposited datasets separate steps required to deposit and run checkcif not often obvious if checkcif has been run response to checkcif alerts can be revealing but largely hidden Can we make it easier for authors to satisfy journal requirements? make it easier for referees to access checkcif reports? remove uncertainty over whether checkcif has been used? make value added through responses more visible? 11
Recently Released http://www.ccdc.cam.ac.uk/deposit Uses new checkcif API 12
Level A Level B Level C Level G Most likely a serious problem - resolve or explain A potentially serious problem, consider carefully Check. Ensure it is not caused by an omission or oversight General information/check it is not something unexpected http://www.ccdc.cam.ac.uk/deposit 13
Recently Released Responses are included in the CIF being deposited. http://www.ccdc.cam.ac.uk/deposit Download of checkcif reports to be added soon Later hope to enable reviewers to run checkcif if depositor did not 14
Possible CSD-based Checks for Small Molecules Could extend/complement checkcif with: geometry check void analysis interaction analysis commonality of spacegroups Could also feedback to depositor about: other determinations of the same compound related structures (e.g. similarity search) 15
Can we rely on the published scientific literature? Can we rely on the published scientific data? Can we rely on knowledge-based analysis? 16
The CSD: Crystallography and Chemistry Provides understanding of molecular geometry and molecular interactions Enables structural knowledge to be applied to scientific problems Assignment of chemistry is required to make data findable, interoperable and reusable 17
Geometry Analysis ConQuest Search CSD 5.36: 3,435 hits Filters: None Mean angle: 121.9(18) o CSD 5.36: BEXYIO R-factor: 14.5% Angle: 109.29 o Atomic Displacement Parameters indicate uncertainty in the position of an atom - typically represented as ellipsoids. Ellipsoids of significantly different sizes may reflect problems in the structure. ADPs available in deposited CIFs but not yet in the CSD. 18
Search Filters CSD 5.36: MAMSUQ R-factor: 4.96% Angle: 128.6 o Without filters With filters 19
Automated Geometry Analysis Mogul compares the geometry of a 3D molecule against the CSD Aims to strike a balance between being too general and too specific Volume and severity of alerts important in drawing conclusions 20
CSD-based Validation of Protein Ligands CSD-based geometry checks included in PDB validation pipeline Bond lengths, bond angles, acyclic torsions and isolated rings are assessed by comparison with preferred molecular geometries derived from high-quality, smallmolecule structures in the Cambridge Structural Database (CSD). http://www.wwpdb.org/validation-reports.html 21
Can we rely on the published scientific literature? Can we rely on the published scientific data? Can we rely on knowledge-based analysis? Can we rely on research data repositories? 22
Trusted Repositories Researchers will expect guidance on how to select an appropriate repository for their data Standards and guidelines for repositories exist and include the Data Seal of Approval the repository selection process for Thompson-Reuters (sic) Data Citation Index Digital Curation Centre Trusted Repositories Audit and Certification (TRAC) program Potential starting points for a community standards discussion http://dx.doi.org/10.1371/journal.pbio.1001975 23
Repository Certification Various stamps of approval available The Data Seal of Approval ICSU World Data System nestor Seal (derived from DIN 31644) ISO 16363 (based on TRAC) From light-weight to heavy-duty: DSA and WDS self-certifying against ~16 criteria nestor Seal 34 criteria ISO 16363 70 pages, formal audit 24
Self Certification Criteria Criteria of DSA and WDS variously cover organizational framework - governance, sustainability data management - authenticity, integrity, accessibility technical infrastructure - support, security Evaluation procedures submissions are peer-reviewed emphasis on public documentation of procedures RDA DSA/WDS Audit and Certification Working Group explore and develop a DSA WDS partnership with the objectives of realizing efficiencies, simplifying assessment options, stimulating more certifications, and increasing impact on the community 25
Global Initiatives in Research Data Bring together researchers in the domain of chemistry for a discussion about the formation of an RDA Interest Group (IG) on Chemical Data. 251st ACS National Meeting & Exposition, San Diego CA March 13-17, 2016 Global initiatives in research data management and discovery How might the chemistry community might best engage with and learn from broader activities in research data management and discovery? 26
Can we rely on the published scientific literature? Can we rely on the published scientific data? Can we rely on knowledge-based analysis? Can we rely on research data repositories? 27
Concluding Thoughts checkcif greatly aids in the assessment of the quality of the data recent developments reduce barriers to running checkcif supplementary insights more likely to be visible in data files Knowledge-based analysis can provide additional insights potential for supplementing existing validation processes important to think about how alerts are used and presented The role of scientific data repositories is important deposition and access services that ensure supporting data is available enrichment of data to enable reuse in validation and other contexts 28
The Cambridge Crystallographic Data Centre International Data Repository Archive of crystal structure data High quality scientific database Scientific Software Provider Search/analysis/visualisation tools Scientific applications Collaborative Research Organisation New methodologies Fundamental research @ccdc_cambridge ccdc.cambridge http://www.ccdc.cam.ac.uk/ Ian Bruno Director, Strategic Partnerships bruno@ccdc.cam.ac.uk Thanks to Mike Hoyland and others at the IUCr for the checkcif API, advice and support. 29