KNIME applications at Syngenta

Similar documents
CDK & Mass Spectrometry

Compounding insights Thermo Scientific Compound Discoverer Software

Chemical Data Retrieval and Management

Handling Human Interpreted Analytical Data. Workflows for Pharmaceutical R&D. Presented by Peter Russell

Chemical Space: Modeling Exploration & Understanding

Analyte Targeting for NPS Identification in Seized Drugs and Toxicology

FIRE DEPARMENT SANTA CLARA COUNTY

The Schrödinger KNIME extensions

Perseverance. Experimentation. Knowledge.

Large scale classification of chemical reactions from patent data

FORENSIC TOXICOLOGY SCREENING APPLICATION SOLUTION

BIOLIGHT STUDIO IN ROUTINE UV/VIS SPECTROSCOPY

NaturalFacts. Introducing our team. New product announcements, specials and information from New Roots Herbal. April 2009

Determination of Density 1

Agilent MassHunter Quantitative Data Analysis

Cheminformatics Role in Pharmaceutical Industry. Randal Chen Ph.D. Abbott Laboratories Aug. 23, 2004 ACS

Counterfeit and illegal pesticides: Strategies for addressing the issue in the analytical laboratory

Online Reaction Monitoring of In-Process Manufacturing Samples by UPLC

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

High-throughput Agrochemical Formulation: Easing the Route to Commercial Manufacture

Process Chemometrics in the Dow Chemical company. Zdravko Stefanov and Leo Chiang Analytical Technology Center The Dow Chemical Company

Chemically Intelligent Experiment Data Management

The Case for Use Cases

Application Note 12: Fully Automated Compound Screening and Verification Using Spinsolve and MestReNova

Since 1988 High Force Research has worked with organizations operating in practically all end use sectors requiring chemical synthesis input, and

Overview. Everywhere. Over everything.

One platform for desktop, web and mobile

Analytical data, the web, and standards for unified laboratory informatics databases

La RMN quantitative appliquée aux petites molécules

Agilent MassHunter Profinder: Solving the Challenge of Isotopologue Extraction for Qualitative Flux Analysis

C HA R AC T E RIZ AT IO N O F INK J E T P RINT E R C A RT RIDG E INK S USING A CHEMOMETRIC APPROACH

Agricultural. Chemistry. Agricultural production: crops and livestock Agrichemicals development: herbicides, pesticides, fungicides, fertilizers, etc.

Farewell, PipelinePilot Migrating the Exquiron cheminformatics platform to KNIME and the ChemAxon technology

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined

ionos The most advanced stable isotope software ever created

Multi-residue analysis of pesticides by GC-HRMS

DATA SCIENCE SIMPLIFIED USING ARCGIS API FOR PYTHON

SEAMLESS INTEGRATION OF MASS DETECTION INTO THE UV CHROMATOGRAPHIC WORKFLOW

Stratimagic. Seismic Facies Classification

Of small numbers with big influence The Sum Of Squares

Bridging the Dimensions:

Sample Preparation. Approaches to Automation for SPE

Reaxys Medicinal Chemistry Fact Sheet

MultiscaleMaterialsDesignUsingInformatics. S. R. Kalidindi, A. Agrawal, A. Choudhary, V. Sundararaghavan AFOSR-FA

A powerful site for all chemists CHOICE CRC Handbook of Chemistry and Physics

Agilent MassHunter Quantitative Data Analysis

Ákos Tarcsay CHEMAXON SOLUTIONS

Hplc Lc Ms And Gc Method Development And Validation Guideline For Academic And Industrial Scientists Involved In Method Development And Validation

Independence and Dependence in Calibration: A Discussion FDA and EMA Guidelines

ArcGIS Pro: Essential Workflows STUDENT EDITION

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

STANDARD OPERATING PROCEDURES

DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA

OECD QSAR Toolbox v.4.1. Step-by-step example for building QSAR model

Student Projects for

Introduction to Computer Tools and Uncertainties

The Green. Chemistry Checklist Why Green Chemistry? The Business Case. Inside. Support and Communication. Design and Innovation

Practical QSAR and Library Design: Advanced tools for research teams

Rosemary extract liquid

Geostatistics and Spatial Scales

CheS-Mapper 2.0 for visual validation of (Q)SAR models

Food Safety and Quality Management System

SpringerMaterials The Landolt-Börnstein Database

for XPS surface analysis

mylab: Chemical Safety Module Last Updated: January 19, 2018

ChemIndustry.com serves the investigative and sourcing needs of up to 350,000 site visitors each month.

34 School Science Lab Assistant

Automated, intelligent sample preparation: Integration of the ESI prepfast Auto-dilution System with the Thermo Scientific icap 7400 ICP-OES

Natural Products. Innovation with Integrity. High Performance NMR Solutions for Analysis NMR

Computer simulation of radioactive decay

Calibrated Virtual Urban Water Systems software tool for one partner city. Software

The integration of NeSSI with Continuous Flow Reactors and PAT for Process Optimization

Chemists are from Mars, Biologists from Venus. Originally published 7th November 2006

A Novel Software Tool for Crystallization Process Development

Implementation of Methods Translation between Liquid Chromatography Instrumentation

AlphaVision-5.3. Integrated, Ethernetconnected. spectrometers. 32-bit software for Windows 2000 and XP Professional ORTEC

ON SITE SYSTEMS Chemical Safety Assistant

Routines for fitting kinetic models to chemical degradation data

Washington Master Address Services: Project Overview Ben Vaught, OCIO David Wright, DOR Craig Erickson, DOH Tom Kimpel, OFM

Chemical Kinetics I: The Dry Lab. Up until this point in our study of physical chemistry we have been interested in

University of Colorado Denver Anschutz Medical Campus Online Chemical Inventory System User s Manual

Everyday NMR. Innovation with Integrity. Why infer when you can be sure? NMR

Introduction to Bruker s Products and Solutions

SS. Trinità Hospital. Borgomanero, Italy. Dr. Anna Tinivella, Director of Diagnostics Department

Thermo Scientific Pesticide Explorer Collection. Start-to-finish. workflows for pesticide analysis

Course in Data Science

The GeoCLIM software for gridding & analyzing precipitation & temperature. Tamuka Magadzire, FEWS NET Regional Scientist for Southern Africa

HAZARD COMMUNICATION SAFETY PROGRAM

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research

PesticideScreener. Innovation with Integrity. Comprehensive Pesticide Screening and Quantitation UHR-TOF MS

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Simplifying the Process: Automated USP 643 / EP

Substance Characterisation for REACH. Dr Emma Miller Senior Chemist

Lecture 2: Linear regression

Reactor Design within Excel Enabled by Rigorous Physical Properties and an Advanced Numerical Computation Package

Quantitative Analysis of Caffeine in Energy Drinks by High Performance Liquid Chromatography

Rapid and Accurate Forensics Analysis using High Resolution All Ions MS/MS

ACD/Labs Software Impurity Resolution Management. Presented by Peter Russell

Application Note LCMS-112 A Fully Automated Two-Step Procedure for Quality Control of Synthetic Peptides

Integrated Cheminformatics to Guide Drug Discovery

Transcription:

KNIME applications at Syngenta Mark Earll Senior Analytical and Data Scientist, Product Characterisation Group Syngenta - Jealott's Hill International Research Centre

Contents Introduction - Syngenta - Product Characterisation Department Automated report writing with BIRT Improved diagnostics for data QC and calibration using R nodes KNIME for compilation of large datasets in Metabolomics and QSAR Some cheminformatics applications 2

We bring plant potential to life Syngenta is one of the world s leading companies with more than 24,000 employees in over 90 countries dedicated to our purpose: Bringing plant potential to life. Our Crop Protection and Seeds products help growers increase crop yields and productivity. We contribute to meeting the growing global demand for food, feed and fuel and are committed to protecting the environment, promoting health and improving the quality of life. 3

Product Characterisation at Syngenta Part of the Technology and Engineering group Analytical chemistry function supporting formulation development, manufacturing and supply chain Work includes - Residual impurity analysis of production batches - Support for formulation development greenhouse and field studies - Identification of counterfeit products and grey imports - Troubleshooting production issues - Preparative separation to support ag-chem and regulatory - Optical spectroscopy support for seeds manufacturing Well equipped labs with LC GC LCMS GCMS SFC IR UV etc. 4

Automation of reporting for analytical chemistry Why automate? - Speed - Remove tedium - Consistency - Prevent re-typing (time wasting) - Prevent transcription errors - Gives audit trail for those who follow - Diagnostic plots - Access to better statistics than that provided in instruments - Cope with a diversity of vendors How - Scripting (R, Python, Perl, Javascript, JMP script) - Workflow tool KNIME 5

BIRT Report Designer in KNIME Takes output from KNIME reporting nodes Fairly intuitive to use Similar to a HTML webpage authoring tool, drag and drop items from the workflow Placeholders for available data and plots Output in almost any format (PDF, Word, Excel ) 6

Automation requires a change in thinking Key to successful automation is CONSISTENCY of INPUT Consistent names, headers, formats Excel templates are useful: 7

Typical workflow MS conditions Integration Reports Excel pro-forma HPLC conditions Report ELN 8

Residual Impurity Analytical Report Workflow Looks horrendously complex, Not actually difficult to use. (could be tidied up by using meta-nodes) Pink highlighted area is the part you have to interact with Process: - Enter location of 2 excel files and one text file - Press Run - Do a bit of QC inspection - circa. <1min runtime - Compare with 2 hours manual write-up 9

The report 10

Example 2: Non-linear calibration Analytical work in support of a microencapsulation formulation project - 5 analytes determined with confidence interval - Slightly non-linear regression - Circa ½ day per report to write up manually x 60 - Potential saving of 30 person-days 11

Improved data diagnostics using R language Using KNIME nodes as containers for R-scripts gives the opportunity to make useful visual data validation tools Using KNIME the R scripts become modular and can be used as building blocks in other workflows by colleagues 12

An R node in more detail Data from workflow input becomes R object knime.in or individual variables may be selected After manipulation in the node data must be returned to the output as a new object knime.out or knime.model Or if using a R view node, sent to a image to report node as a plot for reporting Node can be unit tested and then used in other workflows Can have other nodes in Perl Python, java etc. and mix and match - useful for collaboration 13

Linear regression tools for analytical calibration Several R packages for inverse calibration - chemcal by Johannes Ranke - investr by Brandon Greenwell - quantchem by Łukasz Komsta I have written R script using chemcal to give proper tools for assessment of - Linearity - Limit of Detection* - Limit of Quantitation* - Also gives R2, RMSE, Slope and Intercept - Plus confidence intervals on predictions * based upon ICH guidelines (3.3 or 10 x RMSE) 14

Example 3: Data merging & fusion in Metabolomics LC-MS Metabolomic data scripting - Sorting - Componentisation - Internal standardisation - Block Scaling - Block combination - x 6 datasets (Typically 500 x10,000) 2-3 Days work in excel Each KNIME workflow takes few seconds Invaluable for re-processing and correcting data For simplicity the workflow was divided into manageable steps and verified at each stage... 15

Example 4: Compilation of QSAR data As part of an IVCC (Innovative Vector Control Consortium) funded project we have developed a multivariate QSAR model which has successfully predicted uptake of new insecticides into mosquitoes which has assisted selection of lead candidates. http://www.ivcc.com/ Model relies on compilation of data from 5 molecular descriptor packages together with calculation from the KNIME CDK functions. Problem with descriptor packages is they tend to get updated which means everything has to be recalculated and re-modelled Use of KNIME makes this a much less painful task Main modelling done in SIMCA by PLS but recently more parsimonious models are being developed using the LASSO and Random forest R learner/predictor nodes in KNIME 16

Example 5 Predicting which chiral column to use KNIME comes with a huge range of nodes available including cheminformatics tools Not a real structure Currently evaluating a probabilistic tool to suggest which Chiral column is most likely to give a separation based upon structural analogy with out in-house database of chiral separations and manufacturers databases. Test data success varies but some columns as much as 75% success (IC column) Prediction of Normal or reverse phase (c.90% accuracy on test data) Chemical structure is entered then KNIME calculates molecular fingerprints and physchem properties. These are fed to a Random Forest classification model and a report is sent back to the user with predictions. 17

Example 6: Predicting retention time QSPR Approximate prediction of retention times of plant metabolite was of interest to us to aid with metabolite identification HPLC Retention time of plant metabolites are recorded against CDK calculated molecular properties and fingerprints in KNIME. Using seven different machine learning regression methods from R 126 Metabolites, 27 X Variables, 1 Y variable (Retention time) Random split 50:50 into test and training sets. PLS and PCR gave the most predictive models due to wide and correlated nature of QSPR data Method R2 of Fit R2 ext (Q2) LASSSO 0.706 0.434 RIDGE 0.656 0.768 PLS 0.998 0.994 PCR 0.992 0.981 Random Forest 0.973 0.882 Support Vector Neural Network 0.956 0.775 1.000 0.673 18

7: Multivariate Characterisation of Solvents and HPLC Phases Principal component property maps help selection of alternate columns or solvents. Alternatively can be used to maximise diversity in selections of items. Data - Solvents Rolf Carlsson data - Columns PQRI database USP - Molecular Properties Agro-chemical space KNIME workflows - R node with library(pcamethods) - 2d/3d Scatter plot community contribution by Eli Lilly (Erlwood) - Interactive 5-D plot with hover annotation (including structures) 19

Conclusions Automation of analytical reporting by use of workflows improves consistency of results and removes tedium KNIME is a very flexible data analysis and manipulation platform. Has many advantages compared with copying and pasting spreadsheets The combination of R and KNIME is very useful for containerising R scripts and making them available for re-use. The ability to provide an audit trail of what you have done to the data is invaluable when troubleshooting and validating your work. Data visualisation is essential to promote good quality control of the data - Continuous improvement - Good record keeping 20

Thanks and Acknowledgements Syngenta - Dave Portwood - Mark Seymour - Tom Salvesen - Mark Forster - David Lomath - Pablo Navarro - Thorsten Platz KNIME - Michael Berthold - Thorsten Meinl - Greg Landrum References: - Statistics for the Quality Control Chemistry Laboratory Eamonn Mullins ISBN978-0-85404-671-3 - R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. - Michael R. Berthold et al,knime: The Konstanz Information Miner Studies in Classification, Data Analysis, and Knowledge Organization Springer ISBN = 978-3-540-78239-1, 2007 - Screening of Suitable solvents in Organic Synthesis, Strategies for Solvent Selection Acta Chemica Scandanavia B 39 (1985) 79-91, Rolf Carlson et al. R-Community 21

For more details on R & KNIME OPEN SOURCE SOFTWARE IN LIFE SCIENCE RESEARCH PRACTICAL SOLUTIONS TO COMMON CHALLENGES IN THE PHARMACEUTICAL INDUSTRY AND BEYOND Woodhead Publishing Series in Biomedicine No. 16 EDITED BY LEE HARLAND and MARK FORSTER ISBN 1 907568 97 2 ISBN-13: 978 1 907568 97 8 Chapter 6: Open Source software for Mass Spectrometry and Metabolomics 22

23