Chemometrics. Classification of Mycobacteria by HPLC and Pattern Recognition. Application Note. Abstract

Similar documents
PAT/QbD: Getting direction from other industries

MULTIVARIATE PATTERN RECOGNITION FOR CHEMOMETRICS. Richard Brereton

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research

Discrimination of Dyed Cotton Fibers Based on UVvisible Microspectrophotometry and Multivariate Statistical Analysis

Chemometrics: Classification of spectra

Maximizing Triple Quadrupole Mass Spectrometry Productivity with the Agilent StreamSelect LC/MS System

Application of mathematical, statistical, graphical or symbolic methods to maximize chemical information.

Chemometrics. 1. Find an important subset of the original variables.

Chemometrics. Chemometrics in Chromatography. Application Overview. Abstract

High-Throughput Protein Quantitation Using Multiple Reaction Monitoring

Application Note # AN112 Failure analysis of packaging materials

Deconvolution of Overlapping HPLC Aromatic Hydrocarbons Peaks Using Independent Component Analysis ICA

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX

STUDY ON METHODS FOR COMPUTER-AIDED TOOTH SHADE DETERMINATION

AppNote 7/2002. Classifi cation of Food and Flavor Samples using a Chemical Sensor KEYWORDS ABSTRACT

Optimizing Model Development and Validation Procedures of Partial Least Squares for Spectral Based Prediction of Soil Properties

ECE 592 Topics in Data Science

IR Biotyper. Innovation with Integrity. Microbial typing for real-time epidemiology FT-IR

SEAMLESS INTEGRATION OF MASS DETECTION INTO THE UV CHROMATOGRAPHIC WORKFLOW

Anomaly Detection. Jing Gao. SUNY Buffalo

Discriminant analysis and supervised classification

Table of Contents. Multivariate methods. Introduction II. Introduction I

Multivariate Analysis of Ecological Data using CANOCO

Introduction to Chromatographic Separations

WADA Technical Document TD2003IDCR

Introduction. Chapter 1. Learning Objectives

Linear Models for Classification

Application Note 12: Fully Automated Compound Screening and Verification Using Spinsolve and MestReNova

Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares

INDEPENDENT COMPONENT ANALYSIS (ICA) IN THE DECONVOLUTION OF OVERLAPPING HPLC AROMATIC PEAKS OF OIL

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Automated and accurate component detection using reference mass spectra

Application of Raman Spectroscopy for Detection of Aflatoxins and Fumonisins in Ground Maize Samples

AppNote 2/2002. Discrimination of Soft Drinks using a Chemical Sensor and Principal Component Analysis KEYWORDS ABSTRACT

A Modular NMF Matching Algorithm for Radiation Spectra

Basics of Multivariate Modelling and Data Analysis

C HA R AC T E RIZ AT IO N O F INK J E T P RINT E R C A RT RIDG E INK S USING A CHEMOMETRIC APPROACH

Introduction to Supervised Learning. Performance Evaluation

Natural Products. Innovation with Integrity. High Performance NMR Solutions for Analysis NMR

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

New Dynamic MRM Mode Improves Data Quality and Triple Quad Quantification in Complex Analyses

Quantification of growth promoters olaquindox and carbadox in animal feedstuff with the Agilent 1260 Infinity Binary LC system with UV detection

Multivariate Analysis Cluster Analysis

Identifying Disinfection Byproducts in Treated Water

FINDING DESCRIPTORS USEFUL FOR DATA MINING IN THE CHARACTERIZATION DATA OF CATALYSTS

Course in Data Science

British American Tobacco Group Research & Development. Method - Determination of ammonia in mainstream smoke

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Verification of Pharmaceutical Raw Materials Using the Spectrum Two N FT-NIR Spectrometer

Introduction to Principal Component Analysis (PCA)

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

The Analysis of Organophosphate Pesticides by LC/MS Application

Characterization and Classification of Heroin from Illicit Drug Seizures Using the Agilent 7200 GC/Q-TOF

Petrofuels Quality Control and Origin Determination

Peptide Isolation Using the Prep 150 LC System

Computational paradigms for the measurement signals processing. Metodologies for the development of classification algorithms.

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Gas Chromatography. A schematic diagram of a gas chromatograph

Method Transfer between HPLC and UHPLC Instruments Equipment-related challenges and solutions

Cerno Application Note Extending the Limits of Mass Spectrometry

CHAPTER 1 Role of Bioanalytical Methods in Drug Discovery and Development

Multivariate calibration

Agilent 6400 Series Triple Quadrupole Users Workshop. MH QQQ Users workshop 2/21/2014 1

Positive and Nondestructive Identification of Acrylic-Based Coatings

Semi-Quantitative Analysis of Analytical Data using Chemometric Methods. Part II.

Structural Analysis by In-Depth Impurity Search Using MetID Solution and High Accuracy MS/MS

Title Experiment 7: Gas Chromatography and Mass Spectrometry: Fuel Analysis

A Strategy for an Unknown Screening Approach on Environmental Samples Using HRAM Mass Spectrometry

Machine Learning 2nd Edition

CoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain;

Macrolides in Honey Using Agilent Bond Elut Plexa SPE, Poroshell 120, and LC/MS/MS

Microbiology / Active Lecture Questions Chapter 10 Classification of Microorganisms 1 Chapter 10 Classification of Microorganisms

Observing Dark Worlds (Final Report)

Overview. Introduction. André Schreiber 1 and Yun Yun Zou 1 1 AB SCIEX, Concord, Ontario, Canada

MultiscaleMaterialsDesignUsingInformatics. S. R. Kalidindi, A. Agrawal, A. Choudhary, V. Sundararaghavan AFOSR-FA

Sensitivity of the Elucidative Fusion System to the Choice of the Underlying Similarity Metric

2 D wavelet analysis , 487

Detecting Dark Matter Halos using Principal Component Analysis

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

Agilent MassHunter Profinder: Solving the Challenge of Isotopologue Extraction for Qualitative Flux Analysis

Dimensionality Reduction and Principal Components

The Power of LC MALDI: Identification of Proteins by LC MALDI MS/MS Using the Applied Biosystems 4700 Proteomics Analyzer with TOF/TOF Optics

9 Classification. 9.1 Linear Classifiers

University of Florida CISE department Gator Engineering. Clustering Part 1

Evaluating Urban Vegetation Cover Using LiDAR and High Resolution Imagery

Introduction to multivariate analysis Outline

ON THE CALCULATION OF A ROBUST S-ESTIMATOR OF A COVARIANCE MATRIX

Morphologically directed chemical identification Coupling particle characterization by morphological imaging with Raman spectroscopy

Using radial basis elements for classification of food products by spectrophotometric data

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs

Time Series Classification

CHEMISTRY. CHEM 0100 PREPARATION FOR GENERAL CHEMISTRY 3 cr. CHEM 0110 GENERAL CHEMISTRY 1 4 cr. CHEM 0120 GENERAL CHEMISTRY 2 4 cr.

AppNote 5/2002. Characterization of a Mass Spectrometer based Electronic Nose for Routine Quality Control Measurements of Flavors KEYWORDS ABSTRACT

The Coulter Principle for Outlier Detection in Highly Concentrated Solutions

Three-Dimensional Electron Microscopy of Macromolecular Assemblies

Bruker Daltonics. EASY-nLC. Tailored HPLC for nano-lc-ms Proteomics. Nano-HPLC. think forward

Key Words Q Exactive, Accela, MetQuest, Mass Frontier, Drug Discovery

Discover what you re missing

Transcription:

12-1214 Chemometrics Application Note Classification of Mycobacteria by HPLC and Pattern Recognition Abstract Mycobacteria include a number of respiratory and non-respiratory pathogens for humans, such as M. tuberculosis, the causative agent of the disease for which it is named. Identification of the responsible bacterium is, therefore, a critical first step in public health regulation or medical treatment. Traditional methods have relied on classification based on morphology and enzymatic tests, which are subjective and can be time-consuming (typical turnaround from sample collection to results is measured in weeks). Recently, it has been demonstrated that the fatty acids of the cell walls of these bacteria are diagnostic for some species (1). These fatty acids are mainly comprised of high molecular weight (C 70 -C 90 ) α-branched, β-hydroxy mycolic acids, which, due to their involatility, are not as amenable to gas chromatographic procedures. However, liquid chromatography of these mycolic acids can be utilized (2), and such an approach can provide a more rapid and reproducible method for the identification of target species of Mycobacteria. In this note, chromatographic results from the HPLC method are analyzed with chemometric methods. Principal Components Analysis is shown to aid in visual classification and determination of the presence of outlying or aberrant samples. Predictive models were built for use in rapid, routine classification using the K-Nearest Neighbors approach and SIMCA, the latter based on principal component models of individual classes. Infometrix, Inc, 11807 N Creek Pkwy S, Suite B-111, Bothell, WA 98011 Phone: (425)-402-1450, email: info@infometrix.com

Experimental All laboratory work was performed at the Centers for Disease Control, in Atlanta, GA, where an extensive project is underway to perfect the classification system. The goal of the CDC work is to optimize the speed of analysis without sacrificing accuracy of classification. Prior to chromatographic analysis, samples were derivatized to facilitate their detection. Samples were first prepared by basic saponification, then an acidic extraction, using CHCl 3, isolated the mycolic acids. Derivatization followed, converting the analytes to their p- bromophenacyl esters. A Beckman System Gold HPLC instrument was used to analyze the resulting esters, using a 3 µm C 18, 4.6 mm x 7 cm cartridge column to separate the analytes. The mobile phase was applied in a gradient, running from 80% CH 3 OH in CH 2 Cl 2, to 35%, during 10 minutes, with detection at 260 nm. Of the peaks which eluted, 22 were integrated and stored in a results file for later multivariate analysis. Figure 1 is representative of the results of the chromatographic method, showing some of these peaks. The Infometrix Pirouette software was used to perform the pattern recognition and classifications. Mycolic acids Figure 1. Representative chromatogram showing distribution of mycolic acids Results and Discussion In all, some 50 different Mycobacteria species are under study. The mycolic acids which elute under this chromatographic regime occur throughout the 10 minute gradient window. However, main peak clusters can be observed for the majority of species, with two significant clusters occurring from 5 to 7 minutes and from 7.5 to 9 minutes. Those species which exhibit significant peaks in only the latter cluster are referred to as the Single Cluster species and are the focus of this note. Two data sets containing only single cluster samples were combined to determine the feasibility of a pattern recognition approach to classification. Represented in these combined data were examples of the eight species listed below, for a total of 188 samples analyzed. M. asiaticum M. bovis, BCG M. gastri M. gordonae M. marinum M. tuberculosis Exploratory Data Analysis Two multivariate methods, Hierarchical Cluster Analysis and Principal Components Analysis, were utilized to provide an overview of the distinguishability of species, based on the chromatographic data. In the clustering method, samples are intercompared based on their multivariate distances in a 22-coordinate space (each coordinate axis corresponds to one of the variables or chromatographic peaks). Following PCA, the sample points in 22-space are transformed to a new data space: the new coordinates, the principal component axes, are determined as those which contain the most variance (information) in the data set. Axes which represent mostly noise can be ignored, resulting in a reduction in the dimensionality of the data. These transformed data projections, referred to as scores (see Figure 2), showed that two species formed into distinct clusters; another two species could be distinguished clearly. Samples of the remaining four species were somewhat overlapped. 2014 Page 2

M. asiaticum M. tuberculosis Figure 2. PCA scores of Mycobacteria samples The results from HCA can best be viewed in a dendrogram, as that in Figure 3. Here, we can see that good separation is possible, and in most sub-branches, the species present were homogeneous. In fact, only 2 samples appeared in sub-branches to which they did not clearly belong. M. tuberculosis M. bovis, BCG M. marinum to develop a reliable training set from which predictions of the classes of unknown samples can be accomplished. For this note, an evaluation was first done with the K-Nearest Neighbors method. In this method, the prediction of the class of an unknown or test sample is obtained by observing, in the 22- space defined above, the class of its nearest neighbors in the training set. The first goal of KNN was to determine if outliers were present in training set. One of the two combined sets was designated as a training set (84 samples), with the other as a test set (104 samples). No significant outliers were found, therefore, KNN was repeated to predict the class of the second set. Because in this set the actual species were already known, it was used to validate the model developed during the training step. Six samples were incorrectly classified: samples of were classified as. If many samples of a single species are misclassified, one should not necessarily assume that they are bad samples. Such a situation merits closer inspection to understand the cause. In Figure 4, the misclassified samples were highlighted so that they might better stand out in M. asiaticum M. marinum M. gastri & kansasii & kansasii M. gastri M. bovis, BCG M. gordonae misclassified Figure 3. HCA dendrogram for Mycobacteria samples Classification Analysis A general approach for pattern recognition is Figure 4. PCA scores plot highlighting six misclassified samples 2014 Page 3

the scores plot. This 3D view of scores was rotated to best view the misclassified samples with respect to the other neighboring samples. From this view, it is clear why these samples were missed-there is a subset of samples that are definitely closer to the misclassified samples than are the remaining M. kansasii samples. To probe deeper, another data subset was created which contained only samples of these two overlapped species, and the exploratory algorithms were run once more. In the dendrogram (Figure 5), the data points appear to cluster in 5 groups: two of and three of. Questions arise whether these two species are truly homogeneous. But other considerations include whether there are simply enough samples to adequately characterize the intra-species variation, especially if the samples are found to be properly identified by independent means. In the scores plot of these two species (Figure 6), the subclustering is clearly evident. However, by careful manipulation of the 3D view, a perspective can be achieved in which the two species might be separated by a linear discriminant. Such a view is given in Figure 7. Figure 6. PC scores of and showing subclusters of each species This implies that including specimens from each of the three subclusters in a training set might improve the prediction quality. Thus, a new training set was created from among all of the species, but now including representatives from each of the subsets of M. kansasii and samples. This new training set of 65 samples was evaluated with KNN by predicting the species of all of the remaining samples. In this instance, 100% of the samples were correctly assessed. Figure 5. Dendrogram showing subclustering of M. kansasii and samples 2014 Page 4

M. Goodfellow, and M. Magnusson. Arch. Microbiol. (1984) 139:225-231. 2. Butler, W. Ray and J. O. Kilburn. J. Clin. Microbiol. (1988) 26: 50-53. Figure 7. Discriminant view separating and Conclusions From the results derived with this preliminary study, we can conclude the following: Pattern recognition can be used on HPLC data to rapidly classify Mycobacteria to species and perhaps to strain within a species. The algorithmic classification process can be automated by implementation into an instrumental method, removing operator dependence and therefore the subjectivity. Information can be derived which is not obvious from the traditional approaches or even from a casual examination of the chromatographic profiles (as in the distinction of subclusters within and ). The and mix-up also points out the importance of having a representative set of samples in the training set prior to predicting unknown species. References 1. Minnikin, D.E., S. M. Minnikin, J.H. Parlett, 2014 Page 5