Semi-Quantitative Analysis of Analytical Data using Chemometric Methods. Part II.

Similar documents
Peter L Warren, Pamela Y Shadforth ICI Technology, Wilton, Middlesbrough, U.K.

First Look at Good-Quality EDGES Data

In Silico Spectra Lab. Slide 1

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Physics E-1ax, Fall 2014 Experiment 3. Experiment 3: Force. 2. Find your center of mass by balancing yourself on two force plates.

Data Exploration and Unsupervised Learning with Clustering

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

State of the Art on Mass Spectrometry : Focus and Critical Quality Issues

Data Mining and Matrices

Analysis of wavelength shifts reported for X-shooter spectra

HOW TO ANALYZE SYNCHROTRON DATA

Unsupervised Learning: K- Means & PCA

Automatic Star-tracker Optimization Framework. Andrew Tennenbaum The State University of New York at Buffalo

Become a Microprobe Power User Part 2: Qualitative & Quantitative Analysis

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Developments & Limitations in GSR Analysis

Step 1 Determine the Order of the Reaction

Chemometrics. 1. Find an important subset of the original variables.

Quality by Design and Analytical Methods

Gravity Modelling Forward Modelling Of Synthetic Data

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Using a Hopfield Network: A Nuts and Bolts Approach

Fractional Polynomial Regression

Standards-Based Quantification in DTSA-II Part II

for XPS surface analysis

Discrimination of Dyed Cotton Fibers Based on UVvisible Microspectrophotometry and Multivariate Statistical Analysis

Proyecto final de carrera

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Quantification of JEOL XPS Spectra from SpecSurf

Image Compression. 1. Introduction. Greg Ames Dec 07, 2002

Supporting Information. for. Hexene. Clark R. Landis*,

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Chemometrics: Classification of spectra

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Applying cluster analysis to 2011 Census local authority data

The Theory of HPLC. Quantitative and Qualitative HPLC

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

8/04/2011. last lecture: correlation and regression next lecture: standard MR & hierarchical MR (MR = multiple regression)

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

BCMB/CHEM 8190 Lab Exercise Using Maple for NMR Data Processing and Pulse Sequence Design March 2012

Computer simulation of radioactive decay

LAB 2 1. Measurement of 2. Binomial Distribution

Measurement: The Basics

MANUAL for GLORIA light curve demonstrator experiment test interface implementation

U.S. - Canadian Border Traffic Prediction

ANOVA: Analysis of Variation

Mixed Hierarchical Models for the Process Environment

Latent Variable Methods Course

PTG-NIR Powder Characterisation System (not yet released for selling to end users)

Introduction to Quantitative Analysis

Calibrating the FNAL Booster Ionization Profile Monitor

Ratio of Polynomials Fit One Variable

1 Measurement Uncertainties

Part II. Fundamentals of X-ray Absorption Fine Structure: data analysis

Clinical Chemistry (CHE 221) Experiment # 3 Analysis of rare body fluids by UV-Vis Absorption Spectroscopy

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Scale-up verification using Multivariate Analysis Christian Airiau, PhD.

POLARIZATION OF LIGHT

DART_LAB Tutorial Section 2: How should observations impact an unobserved state variable? Multivariate assimilation.

Designing Information Devices and Systems II Fall 2018 Elad Alon and Miki Lustig Homework 9

PHY 123 Lab 1 - Error and Uncertainty and the Simple Pendulum

Statistics for Managers using Microsoft Excel 6 th Edition

Rydberg constant from atomic spectra of gases

FRAM V5.2. Plutonium and Uranium Isotopic Analysis Software

The Model Building Process Part I: Checking Model Assumptions Best Practice

CS Algorithms and Complexity

Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University

On the calibration of WFCAM data from 2MASS

Lab 1 Uniform Motion - Graphing and Analyzing Motion

A SAS/AF Application For Sample Size And Power Determination

Big, Fast, and Cheap: Precision Timing in the Next Generation of Cherenkov Detectors LAPPD Collaboration Meeting 2010

New Developments in Econometrics Lecture 9: Stratified Sampling

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

85. Geo Processing Mineral Liberation Data

Differential Attribute Profiles in Remote Sensing

Sigma Bond Metathesis with Pentamethylcyclopentadienyl Ligands in Sterically. Thomas J. Mueller, Joseph W. Ziller, and William J.

Hotelling s One- Sample T2

Raman Spectroscopy of Liquids

Monitoring Emulsion Polymerization by Raman Spectroscopy

INDEPENDENT COMPONENT ANALYSIS (ICA) IN THE DECONVOLUTION OF OVERLAPPING HPLC AROMATIC PEAKS OF OIL

UNIT 3 CONCEPT OF DISPERSION

Parallelism and Machine Models

Explaining Correlations by Plotting Orthogonal Contrasts

Deconvolution of Overlapping HPLC Aromatic Hydrocarbons Peaks Using Independent Component Analysis ICA

NEW CORRECTION PROCEDURE FOR X-RAY SPECTROSCOPIC FLUORESCENCE DATA: SIMULATIONS AND EXPERIMENT

Of small numbers with big influence The Sum Of Squares

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Measurements of a Table

A Scientific Model for Free Fall.

Machine Learning Applied to 3-D Reservoir Simulation

More on Unsupervised Learning

Urban Transportation Planning Prof. Dr.V.Thamizh Arasan Department of Civil Engineering Indian Institute of Technology Madras

1 Introduction to Minitab

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Classification: The rest of the story

CUSUM(t) D data. Supplementary Figure 1: Examples of changes in linear trend and CUSUM. (a) The existence of

Space Group & Structure Solution

1 Measurement Uncertainties

Transcription:

Semi-Quantitative Analysis of Analytical Data using Chemometric Methods. Part II. Simon Bates, Ph.D. After working through the various identification and matching methods, we are finally at the point where we can directly address semi-quantitative methods without the use of standards. Previously, it has been discussed why the use standard calibration and validation samples is often the primary cause of real error in quantitative methods. This problem is a known problem when performing any type of quantitative analysis on solid state samples. A number of chemometric methods have been developed to specifically deal with quantitative analysis without having to use standards. One of the first approaches developed was fundamental parameters. This approach is not really chemometrics in that it is based upon detailed theoretical knowledge of how a sample and how each individual phase responds to the analytical method being used. What is key to developing a fundamental parameters (FP) method is the availability of a theoretical frame work that is able to calculate the relative response of each phase being quantified to the specific analytical technique being used. For example, the Rietveld method is considered to be an FP method for XRPD. Given just the single crystal structures of each phase, the Rietveld program can calculate (predict) what a measured powder pattern will look like for any phase composition. Rietveld programs will also have the ability to model matrix effects as localized micro-absorption anomalies although the model used is in general overly simple. As discussed previously when matrix effects were initially modeled, the drug product (tablet) is close to an ideal powder sample and in most cases is expected to have a very low level of matrix problems making them ideal for FP methods. Although FP methods are well established for X-ray based techniques, they are less common for other analytical methods. SSNMR models do exists but are rarely used for semi-quant to date I have not seen a working Raman method although there is no reason why one could not be developed.

FP methods by definition are very specific to a single analytical technique. The semi-quant chemometric methods are on the other hand applicable to a wide array of different types of analytical data. The bulk of the chemometric methods follow a very similar path to arrive at the semi-quant numbers. Some of the more common methods are called Mixture Self Analysis:, Pure Curve Resolution, Pure Factor Analysis and incorporate techniques such as Alternating Least Squares (ALS). Given an ensemble of data, these techniques have two goals. The first is to identify the pure components of the data and the second is to return semi-quant numbers of each data set with respect to the individual pure components. There is no reliance on any data outside of the initial data ensemble. For XRPD, the pure components would be reference powder patterns for each pure phase in the sample mixture. (An excipient contribution will be treated as a single pure phase as in principle its powder pattern is unchanging for each drug product sample even each excipient contains many different phases.) So one component of the method is to find the smallest number of pure components that can describe the large majority of the measured data files these will represent the pure phase reference patterns. In the second component, the pure phase reference patterns are compared to each measured data set to derive the optimum phase composition of each sample. The estimation of the pure-curves followed by the semi-quant determination, alternates between determining phase composition and pure component extraction. This continuous hopping between determining phase composition and determining the pure curves is why the fitting component is often called ALS Alternating Least Squares. In the brute force methods, the initial starting pure curves are simply straight lines, relying on the least squares fitting to distort the initial straight line towards the powder patterns for each pure phase. To realistically use the brute force methods requires that the data files be binned and compressed to minimize the number of data points without losing peak information. Typically, a powder pattern with step size of 0.02 degrees 2Theta can be binned, by a factor of two, three times giving the final step size of 0.16 degrees 2Theta. This reduces the number of data points by about 10x. In the brute force ALS methods, each data point becomes a variable for each phase. So for a mixture with 5 phase components using powder patterns with 2000 data points, the pure curve least-squares component will have at least 25,000 variables to fit. By binning the data, this reduces to 2,500 variables which is still huge for any normal fitting program to handle. The semi-quant chemometric method discussion will use two sets of mixture data to illustrate the capabilities and limitations. The first data set is a binary system consisting of 11 data files of different composition including powder patterns of the two pure phase; see Fig.1.

Figure 1: Plots of the two pure phase powder patterns for the binary system. Between 23.0 and 25.0 degrees 2Theta, both phases have a strong characteristic peak. The first phase (blue) has its peak characteristic at ~23.5 while the second phase (red) has its characteristic peak at ~24.5 degrees 2Theta. Fortunately, a method similar to PCA can be used on the raw data ensemble to extract out potential pure curves and provide a good starting point for the ALS method. This method has been previously published as the Pure Curve Resolution method. All data files are initially normalized to common area to give scaled semi-quant numbers (the use of different normalization approaches and the implications for quantitative analysis is discussed in a later section). The mean and variance are calculated point by point across all the data files in the ensemble, see Fig.2.

Figure 2: Mean (blue) and Variance (red) for the data ensemble (derived using the unit vector normalization). Both mean and variance traces should have the character of all phases in the sample. Note that both the characteristic peak of the first phase (~23.5 degrees) and the second phase (~24.5 degrees) are visible in both traces. The key part of the pure curve resolution is the identification of peaks in the variance plot that belong entirely to one phase are not overlapped. The ability of the pure curve resolution method is limited by the ability to identify peaks in the variance belonging to just one phase. Often the simplest approach is to track the relative change in the peak intensity across the data ensemble. Peak belonging to the same form will follow the same relative change curve. Peaks from different phases will show a variation that is orthogonal to each other, see Fig.3.

Figure 3: Trace of the relative variance change for 4 peak intensities across the ensemble of samples. There are 2 distinct non-correlated variations in the relative variance. Two peaks (red and green) follow one trend while the other two peaks (red and blue) follow the other trend. The trends are orthogonal, which is as expected for a binary system. One set of peaks belongs to the first phases while the other set of peaks belongs to the second phase. Using these peaks as anchors for the correlation calculation will allow the extraction of both pure curves. Using the peaks identified as belonging entirely to one phase or the other with no overlap, the correlation function can be used to split the variance into two orthogonal components. Each component will correspond to the variance introduced by one of the phases in the mixture. By combining the mean response with the individual variance components, the pure curves for each phase can be calculated and extracted,, see Figs.4a and 4b.

Figure 4a: The pure curve 1 trace plotted with the measured reference pattern for Form B (second phase). Although both curves are plotted together, the data is so similar that only a single curve is visible. The pure curve resolution is able to extract the pure reference patterns down to the smallest detail from the data ensemble itself! No need to make individual pure phase standards.

Figure 4b: Same result as above but now for the second pure curve and the measured powder pattern for Form A (the first phase). As with the previous plot, the two curves are indistinguishable which illustrates the capabilities of the pure curve resolution method. The mean response corresponds to parts of the powder patterns that are common to all the pure curves and the correlated variance components are the unique features for each individual pure curve. To isolate the pure curve for the first phase, the correlated variance associated with that phase is combined with the mean response. From the sum of these two components, the correlated variance from the other phases are subtracted, thus removing all unique features of those phases from the residual. The primary unknown to be resolved in this simple calculation is the scale factor to be used for each phase during the addition and subtraction calculation. To determine the optimum value, all the scale factors are initially set to 1.0 and then increased to a maximum value that does not lead to negative intensity in the residual pure curve. For the binary example present so far, the scale factor for pure curve 1 is unchanged at 1.0, while for pure curve 2 the scale factor was increased to 1.6. The pure curves being real powder patterns must not have negative data points, which provide a useful measure of the correct scale factor. To improve the sensitivity to the cut-off for the scale factor, the backgrounds for each data file should be removed before analysis. As the equal area scaling was used (which assumes that all phases have the same scattering efficiency), each pure curve extracted will have the same normalized area. With the initial pure curves now being extracted from the data ensemble, a process of full pattern fitting can be used to provide an estimate of the phase composition for each sample. These initial pure curves are then fitted to each measured data set to derive scale factors for each phase in each sample. These phase scale factors for each sample are directly related to the phase composition of that sample. As the equal area normalization was used, the scale factors returned relate to each phase having the same scattering efficiency. So each phase is assumed to have the same scattering response. This is approximately true of organic materials but may introduce an absolute error of the order of 3%. In Fig.5, the results of the semi-quant analysis are displayed as calculated weight percent against known weight percent for Form B. There is a systematic shift in the semi-quant result which increases the absolute error to be about 3%. The systematic error could be removed through a calibration step which would leave a random error of about 1% to 2% over the complete composition range used.

Figure 5: Semi-quant results for binary mixture data. The red squares represent the known amount of Form B present in the sample and the blue dots represent the calculated (predicted) amount of Form B based upon the chemometric analysis. There appears to be a systematic shift in the data that appears to be similar to matrix effects. This plot represents the absolute error in the method. The mean absolute error per point is about 3% which is typical of semi-quant methods. The systematic shift could be removed using a calibration step which would reduce the absolute error in the semi-quant numbers to be between 1% and 2%. Ternary mixture systems The real test of the chemometric semi-quant (standardless) method is to use it on more complex mixture samples having more than just two phase components. In the following discussion, a ternary mixture was analyzed, The data ensemble consisted of 21 powder samples with phase compositions running from 10% to 80% for each phase component. As the mixtures become more complex with more components, it becomes harder to isolate unique peaks that only have contributions from a single phase with no overlap. For the ternary system 3 such peaks could be identified that exhibited the orthogonal correlation property for the variance. see Fig.6.

V19 V3 V18 V17 V12 V6 V11 V10 V5 V8 V4 V16 V14 V20 V2 V9 V7 V15 V13 V1 V21 0 20 40 60 80 100 120 0 40 80 120 Hierarchical Clustering Click to cut the tree inertia gain Figure 6a: PCA cluster analysis identifies 3 main clusters within the mixture data.

Dim 2 (39.47%) -30-20 -10 0 10 20 Factor map cluster 1 cluster 2 cluster 3 V19 V3 V17 V18 V15 V1 V13 V9 V11 V4 V7 V12 V6 V5 V10 V8 V21 V16 V14 V2 V20-40 -20 0 20 40 Dim 1 (48%) Figure 6B: Clustering based upon the PCA dendogram cleanly splits the measured data files into 3 groups dominated by the pure reference pattern for each phase (V19, V20, V21 are the reference patterns for Form A, Form B and Form C respectively.

Figure 6: Normalized intensity variation for 3 unique peaks across the data ensemble. The traces appear to show orthogonal behavior indicating that each peak maybe unique to a single phase component. Following the same procedure as used for the binary system, the individual pure curves can be extracted once the unique peaks have been identified. The 3 pure curves extracted from the data ensemble match the measured reference patterns for each phase down to the smallest level of detail, see Figs 6a, 6b and 6c. So provided unique free standing peaks can be identified for each phase, the chemometric semi-quant method can be used to provide quantitative information on the phase composition.

Figure 6a: Pure Curve 1 plotted with Form C measured data. Again the agreement is is good, only a single plot can be observed.

Figure 6B: Pure Curve 2 plotted with measured Form B data. The agreement between the two curves is excellent. Figure 6C: Plot of Pure Curve 3 with Form A measured data. The pure curve method appears to be able to isolate effective reference patterns even for complex mixtures provided at least one unique non overlapping peak can be found. The 3 pure curves can be fit to each of the measured powder patterns using a single scale factor for each phase. Within the equal area normalization, the scale factor is directly proportional to the individual phase concentration in the sample. The following Figs. 7a, 7b and 7c show the actual phase concentration against the predicted phase concentration from the chemometric method.

Figure 7a: actual versus predicted phase concentration for Form A. The chemometric semiquant method agrees very well with the known concentrations. The systematic and random errors when combined are significantly less than 1%.

Figure 7b: actual versus predicted phase concentration for Form B. The chemometric semiquant method agrees very well with the known concentrations. The systematic and random errors when combined are significantly less than 1%. Figure 7c: actual versus predicted phase concentration for Form C. The chemometric semiquant method agrees very well with the known concentrations. The systematic and random errors when combined are significantly less than 1%. The chemometric semi-quant method appears to return better results (in terms of accuracy, precision and robustness) for the ternary mixture system than for the simpler binary mixture system. For the ternary system, the phase composition was predicted correctly to significantly better than 1%. This is an incredible result considering no standards were used to calibrate the system. Although the predictive results returned in the ternary example were good enough in the first pass, these first pass results can be fed into the brute force and ignorance (BFI) ALS method. The ALS approach can also be used when the pure curve resolution is unable to resolve the reference spectra due to sever peak overlap.

The ALS implementation in R is very specific about the data format to be used for input. The analytical data should be presented in a matrix of measured responses only, the x-values (2Theta) being treated outside of the refinement. Each measured spectrum is a row in the data matrix. So for the ternary example above with 21 powder patterns each with 360 data points( binned from 2876 original 2Theta values) the input data matrix Data will be a 21 x 360 numeric matrix. The 2Theta values are stored in a separate vector X2 with 360 entries (length 360). A second vector X is used to store the sequential run numbers (or could be the time when measurements are made) and this will be of length 21 for the ternary example. The pure curve reference spectra should also be entered as a stand-alone matrix; this time the matrix is a column matrix rather than the row matrix format used for the measured data. If the pure curves have not been previously extracted, a random matrix input could be used or the reference spectra could be started as the mean response. Different starting points for the pure curves should be tried if they are not well established. In the ternary example, the pure curve reference matrix S is a 360 x 3 matrix and contains the 3 pure curves. The final data input is the initial composition profile for each pure curve in each sample. For 21 samples each with 3 phase components, the input composition profile is a 21 x 3 numeric matrix; Cstart. (The matrix notation 21 x 3 indicates 21 rows and 3 columns.) The remaining algorithm inputs relate to control variables for the method. The least squares threshold value (that stops the refinement) is typically set to 0.001. Alternatively, the stop point for the method can be a maximum number of iterations. Being a bimodal refinement algorithm, you can select whether to refine the pure curves or the concentration profile as the first step. As with most ALS routines there are the non-negativity constraints that can be applied to the reference spectra and the concentration profile. That is, you do not expect to see negative phase concentrations nor do you expect to see negative peaks in the analytical data. Least squares methods usually work best when the fitted variables are kept within similar ranges of values. For example, the concentration profiles will run from 0.0 to 1.0 for each phase. To balance the refinement, it is common practice to set the maximum of each data set to be 1.0 and normalize all other data points in the spectra to the maximum of 1.0. This can present a problem for accurate quantitative analysis where it is better to follow a normalization procedure suggested by the analytical technique used. For XRPD on organic materials, the equal volume normalization is a good first pass normalization, but normalizing to sample mass is the best procedure for XRPD. The equal area or mass normalization can be scaled to keep all measured data points under a maximum value of 1.0. The non-negativity constraints lose their ability to constrain the data to realistic results if there is a significant background component present in the measured spectra. So it goes without saying that background subtraction must be performed on the measured spectra before jumping into the ALS technique. (As each data point is variable, the measured data files should also be binned to the smallest number of data points possible without losing information content. In the ternary example, each measured

spectra of 2876 data points was binned down to 360 data points which is a more reasonable number for ALS.) The final constraint is for the concentration profiles and is called the closure constraint. This allows you to define the total concentration of each sample. In the ternary example, with the 3 reference patterns identified we have 100% of the phase composition, so the closure value should be set to 1.0 for each sample. (i.e. we know there are only the 3 phases we are modeling present in the sample. If unknown phases are present in the sample, the closure constraint is not applied.) To test the utility of the ALS black box method, the ternary example data was run through the method after setting all 3 of the reference spectra to straight lines. The magnitude of the line was set to 1 for all phases. The results of the pure curve resolution using ALS are shown in Figs. 8a, 8b and 8c. The results of the ALS brute force method are plotted with the reference patterns derived using the pure curve resolution method. The two sets of reference patterns essentially overlap. A few peak intensities show some minor differences but on average the agreement is remarkably good. So the black box brute force ALS approach is able to recover reference spectra from mixture data. Figure 8a: pure curve reference spectra derived for Form A using both the pure curve method and the ALS approach. There are some minor peak intensity differences with an excellent overall agreement.

Figure 8b: pure curve reference spectra derived for Form B using both the pure curve method and the ALS approach. There are some minor peak intensity differences with an excellent overall agreement. Figure 8c: pure curve reference spectra derived for Form C using both the pure curve method and the ALS approach. There are some minor peak intensity differences with an excellent overall agreement.

The ALS black box will also refine the composition profiles for each phase in each sample. The predicted composition results from the ALS black box are very close to the predicted values using the pure curve methods. Once again, the composition of each phase was predicted to better than 1%. Fig.9, below shows the predicted verses known composition for Form A as derived by the ALS black box method. Figure 9: Predicted versus know composition for Form A as derived from the black box ALS approach. The errors in the predicted values are a little larger for the ALS method but the accuracy is still better than 1%. Normalization Schemes and their Impact on semi-quant numbers The chemometrics methods will always require some form of data normalization. The normalization step is an integral part of many of the methods. When performing phase identification and clustering, normalizing to a unit vector can be the best option to choose. This normalization scheme, however, must not be used for semi-quant analysis. For XRPD, the two most common normalization methods are equal area and maximum height. The equal area normalization is useful for semi-quant analysis and involves adjusting the measured intensities such that the integrated response for each sample over the measurement range is the same. This is effectively the implementation of Vainshtein s law of equal scattering efficiency for organics. So the equal area normalization assumes all phases have the same total scattering efficiency over the measurement range.

When the scattering efficiency rule no longer realistically applies then normalization to sample mass is the best choice. To apply this normalization, the sample weight must be measured prior to analysis. This normalization step works as the sample density will typically follow the electron density and the electron density of a material is the driver for X-ray scattering. For more information or to discuss this White Paper, please contact: Simon Bates, Ph.D. Research Fellow Triclinic Labs, Inc. 1-765-588-5632 sbates@tricliniclabs.com