KNIME applications at Syngenta

KNIME applications at Syngenta Mark Earll Senior Analytical and Data Scientist, Product Characterisation Group Syngenta - Jealott's Hill International Research Centre

Contents Introduction - Syngenta - Product Characterisation Department Automated report writing with BIRT Improved diagnostics for data QC and calibration using R nodes KNIME for compilation of large datasets in Metabolomics and QSAR Some cheminformatics applications 2

We bring plant potential to life Syngenta is one of the world s leading companies with more than 24,000 employees in over 90 countries dedicated to our purpose: Bringing plant potential to life. Our Crop Protection and Seeds products help growers increase crop yields and productivity. We contribute to meeting the growing global demand for food, feed and fuel and are committed to protecting the environment, promoting health and improving the quality of life. 3

Product Characterisation at Syngenta Part of the Technology and Engineering group Analytical chemistry function supporting formulation development, manufacturing and supply chain Work includes - Residual impurity analysis of production batches - Support for formulation development greenhouse and field studies - Identification of counterfeit products and grey imports - Troubleshooting production issues - Preparative separation to support ag-chem and regulatory - Optical spectroscopy support for seeds manufacturing Well equipped labs with LC GC LCMS GCMS SFC IR UV etc. 4

Automation of reporting for analytical chemistry Why automate? - Speed - Remove tedium - Consistency - Prevent re-typing (time wasting) - Prevent transcription errors - Gives audit trail for those who follow - Diagnostic plots - Access to better statistics than that provided in instruments - Cope with a diversity of vendors How - Scripting (R, Python, Perl, Javascript, JMP script) - Workflow tool KNIME 5

BIRT Report Designer in KNIME Takes output from KNIME reporting nodes Fairly intuitive to use Similar to a HTML webpage authoring tool, drag and drop items from the workflow Placeholders for available data and plots Output in almost any format (PDF, Word, Excel ) 6

Automation requires a change in thinking Key to successful automation is CONSISTENCY of INPUT Consistent names, headers, formats Excel templates are useful: 7

Typical workflow MS conditions Integration Reports Excel pro-forma HPLC conditions Report ELN 8

Residual Impurity Analytical Report Workflow Looks horrendously complex, Not actually difficult to use. (could be tidied up by using meta-nodes) Pink highlighted area is the part you have to interact with Process: - Enter location of 2 excel files and one text file - Press Run - Do a bit of QC inspection - circa. <1min runtime - Compare with 2 hours manual write-up 9

The report 10

Example 2: Non-linear calibration Analytical work in support of a microencapsulation formulation project - 5 analytes determined with confidence interval - Slightly non-linear regression - Circa ½ day per report to write up manually x 60 - Potential saving of 30 person-days 11

Improved data diagnostics using R language Using KNIME nodes as containers for R-scripts gives the opportunity to make useful visual data validation tools Using KNIME the R scripts become modular and can be used as building blocks in other workflows by colleagues 12

An R node in more detail Data from workflow input becomes R object knime.in or individual variables may be selected After manipulation in the node data must be returned to the output as a new object knime.out or knime.model Or if using a R view node, sent to a image to report node as a plot for reporting Node can be unit tested and then used in other workflows Can have other nodes in Perl Python, java etc. and mix and match - useful for collaboration 13

Linear regression tools for analytical calibration Several R packages for inverse calibration - chemcal by Johannes Ranke - investr by Brandon Greenwell - quantchem by Łukasz Komsta I have written R script using chemcal to give proper tools for assessment of - Linearity - Limit of Detection* - Limit of Quantitation* - Also gives R2, RMSE, Slope and Intercept - Plus confidence intervals on predictions * based upon ICH guidelines (3.3 or 10 x RMSE) 14

Example 3: Data merging & fusion in Metabolomics LC-MS Metabolomic data scripting - Sorting - Componentisation - Internal standardisation - Block Scaling - Block combination - x 6 datasets (Typically 500 x10,000) 2-3 Days work in excel Each KNIME workflow takes few seconds Invaluable for re-processing and correcting data For simplicity the workflow was divided into manageable steps and verified at each stage... 15

Example 4: Compilation of QSAR data As part of an IVCC (Innovative Vector Control Consortium) funded project we have developed a multivariate QSAR model which has successfully predicted uptake of new insecticides into mosquitoes which has assisted selection of lead candidates. http://www.ivcc.com/ Model relies on compilation of data from 5 molecular descriptor packages together with calculation from the KNIME CDK functions. Problem with descriptor packages is they tend to get updated which means everything has to be recalculated and re-modelled Use of KNIME makes this a much less painful task Main modelling done in SIMCA by PLS but recently more parsimonious models are being developed using the LASSO and Random forest R learner/predictor nodes in KNIME 16

Example 5 Predicting which chiral column to use KNIME comes with a huge range of nodes available including cheminformatics tools Not a real structure Currently evaluating a probabilistic tool to suggest which Chiral column is most likely to give a separation based upon structural analogy with out in-house database of chiral separations and manufacturers databases. Test data success varies but some columns as much as 75% success (IC column) Prediction of Normal or reverse phase (c.90% accuracy on test data) Chemical structure is entered then KNIME calculates molecular fingerprints and physchem properties. These are fed to a Random Forest classification model and a report is sent back to the user with predictions. 17

Example 6: Predicting retention time QSPR Approximate prediction of retention times of plant metabolite was of interest to us to aid with metabolite identification HPLC Retention time of plant metabolites are recorded against CDK calculated molecular properties and fingerprints in KNIME. Using seven different machine learning regression methods from R 126 Metabolites, 27 X Variables, 1 Y variable (Retention time) Random split 50:50 into test and training sets. PLS and PCR gave the most predictive models due to wide and correlated nature of QSPR data Method R2 of Fit R2 ext (Q2) LASSSO 0.706 0.434 RIDGE 0.656 0.768 PLS 0.998 0.994 PCR 0.992 0.981 Random Forest 0.973 0.882 Support Vector Neural Network 0.956 0.775 1.000 0.673 18

7: Multivariate Characterisation of Solvents and HPLC Phases Principal component property maps help selection of alternate columns or solvents. Alternatively can be used to maximise diversity in selections of items. Data - Solvents Rolf Carlsson data - Columns PQRI database USP - Molecular Properties Agro-chemical space KNIME workflows - R node with library(pcamethods) - 2d/3d Scatter plot community contribution by Eli Lilly (Erlwood) - Interactive 5-D plot with hover annotation (including structures) 19

Conclusions Automation of analytical reporting by use of workflows improves consistency of results and removes tedium KNIME is a very flexible data analysis and manipulation platform. Has many advantages compared with copying and pasting spreadsheets The combination of R and KNIME is very useful for containerising R scripts and making them available for re-use. The ability to provide an audit trail of what you have done to the data is invaluable when troubleshooting and validating your work. Data visualisation is essential to promote good quality control of the data - Continuous improvement - Good record keeping 20

Thanks and Acknowledgements Syngenta - Dave Portwood - Mark Seymour - Tom Salvesen - Mark Forster - David Lomath - Pablo Navarro - Thorsten Platz KNIME - Michael Berthold - Thorsten Meinl - Greg Landrum References: - Statistics for the Quality Control Chemistry Laboratory Eamonn Mullins ISBN978-0-85404-671-3 - R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. - Michael R. Berthold et al,knime: The Konstanz Information Miner Studies in Classification, Data Analysis, and Knowledge Organization Springer ISBN = 978-3-540-78239-1, 2007 - Screening of Suitable solvents in Organic Synthesis, Strategies for Solvent Selection Acta Chemica Scandanavia B 39 (1985) 79-91, Rolf Carlson et al. R-Community 21

For more details on R & KNIME OPEN SOURCE SOFTWARE IN LIFE SCIENCE RESEARCH PRACTICAL SOLUTIONS TO COMMON CHALLENGES IN THE PHARMACEUTICAL INDUSTRY AND BEYOND Woodhead Publishing Series in Biomedicine No. 16 EDITED BY LEE HARLAND and MARK FORSTER ISBN 1 907568 97 2 ISBN-13: 978 1 907568 97 8 Chapter 6: Open Source software for Mass Spectrometry and Metabolomics 22