Exploring Raman spectral data using chemometrics

Laboratoire de Spectrochimie Infrarouge et Raman Exploring Raman spectral data using chemometrics Ludovic Duponchel Laboratoire de Spectrochimie Infrarouge et Raman UMR CNRS 8516 Université Lille I Villeneuve d Ascq France ludovic.duponchel@univ-lille1.fr http://lasir.univ-lille1.fr

We need tools! Modern spectroscopic instrumentation: generation of an always larger amounts of data fairly automatically in a very short period of time. The complexity of the data is always higher but what we believe to be a brake is a real opportunity to solve the given analytical problem. development of chemometrics tools/methods to make sense of such data set and extract the maximum amount of useful information.

What is chemometrics? "Chemometrics is the chemical discipline that uses mathematics, statistics and formal logic (a) to design or select optimal experimental procedures; (b) to provide the maximum relevant chemical information by analysing chemical data; and (c) to obtain knowledge about chemical systems." Professor D.L. Massart

What can we do with chemometrics in spectroscopy? Quantitative analysis and Calibration (undoubtedly the oldest task): relate instrumental measurements to the compound of interest in complex mixtures. Multivariate calibration : analysis of several measurements (Raman scattering at different wavelength) from several samples. Quantitative model development, a two-step procedure: 1) data is calibrated (the model development step) Main idea : try to find a F function linking concentration and spectra. Concentration of interest = F(spectrum) 2) predictions based from the calibration model. The most used multivariate regression method: the well-known PLS regression method (Partial least squares regression). Only the concentration of the compound of interest has to be known.

Chemometrics task #1: Quantitative analysis / Multivariate calibration Calibration step Selection of m representative samples Validation step Selection of p representative samples Spectral analysis Concentrations from reference analysis Spectral analysis Concentrations from reference analysis Prediction on real unknown samples Spectral correction Spectral correction Model OK? Development of a calibration equation : Concentration of interest = F(spectrum) Predicted concentrations Accuracy estimation RMSEP

A quantitative analysis exemple in biotechnology A representative case: CHO (Chinese hamster ovary) cells cultivated in order to produce monoclonal antibodies. The pharmaceutical industry currently focuses on the QbD (Quality by Design) initiative : identify, monitor, and ultimately control all the critical process parameters (CPPs), which affect the quality of a product. Common CPPs for cell culture: Physical parameters: temperature, ph Chemical properties: substrate and by-product concentrations (glucose, lactate, glutamine ) Biochemical properties: cell number, viability A time-consuming task with conventional methods. In-line prediction of all these CPPs only with Raman spectra. (possible bioprocess feedback in time)

A quantitative analysis exemple in biotechnology Exemple from the paper: Spectral analysis Spectral correction Spectral analysis / Instrumental setup: Stainless steel immersion probe with sapphire window fitted into the bioreactor. Excitation at 785 nm / 350 mw of power output. Spectral range: 350-1750 cm -1 Total acquisition time per spectrum: 300 s. Spectral corrections: Cosmic rays removal. Derivative (remove fluorescence effect ). Glucose model Glutamine Model Glutamate Model Ammonia Model Lactate Model TCD Model Predicted concentrations of Glucose Predicted concentrations of Glutamine Predicted concentrations of Glutamate Predicted concentrations of Ammonia Predicted concentrations of Lactate Predicted Total Cell Density

A quantitative analysis exemple in biotechnology Model accuracy somewhat less precise than the reference methods but an in-line analysis and corresponding report every 5 minutes! A possible feedback for bioprocess optimization. O Predicted values from spectra References values

Chemometrics task #2: exploratory data analysis Given a spectral dataset, looking at relationships between samples / variables. Are there some similar samples or dissimilar ones? Find structures, find outliers. A new vision for spectroscopy: A spectrum can be consider as a point in a multidimentional space. Not so easy for a chemist introduction of the well-known Principal Component Analysis (PCA) which is the basis of many chemometrics methods.

The well-known Principal Component Analysis (PCA) A very simple dataset : 12 Raman spectra of 12 samples (three wavelength per spectrum) λ3 AU1 AU2 AU3 Sample X Raman spectrum λ1 λ2 λ3 λ AU3 * * * * Sample X * * * * * * AU1 * * λ1 AU2 λ2

The well-known Principal Component Analysis (PCA) A very simple dataset : 12 Raman spectra of 12 samples (three wavelength per spectrum) PC 2: second principal component (expresses the residual variance / orthogonal to the previous PC ) λ2 Spectral dataset * * * * PCA λ3 * * Score1 * * * * * * Score2 Find the direction of maximal variance Sample X PC 1: first principal component (expresses the maximal variance) λ1 Scores: new representation (coordinates) of spectra in the new data space

The well-known Principal Component Analysis (PCA) Why PCA is so interesting for Raman dataset exploration? Example: given 500 wavelength Raman spectra, extracting less than 10 PCs is often enough to extract 90% of the total variance you have in your initial Raman spectral data (due to many correlations between wavelength i.e. redundant information). PCA advantages: Keep the total variance with a lower number of dimensions. A spectrum is represented by a low number of scores (its new coordinates). Possible 2D or 3D representations (scores plot) keeping the chemical information you have in the initial Raman dataset. Use a distance between points (samples) in order to estimate their similarities or dissimilarities.

Exploratory Data Analysis with PCA Exemple from the paper: Study focused on a particular pharmaceutical product manufactured by a company and sold on the market in capsules (solid form). Question: is it possible to observe spectral differences between original products and counterfeits with Raman spectroscopy?

Exploratory Data Analysis with PCA The dataset: Twenty-seven seizures of counterfeits, six batches of genuine product. Total number of samples: 258 capsules. Instrumental setup: Raman spectroscopy with optical probe (6 mm diameter laser spot on the analyzed capsule) Excitation at 785 nm / 260 mw of power output. Spectral range: 200-1890 cm -1 Total acquisition time per spectrum: 30 s. PCA on 1st derivative spectra Some raw Raman spectra of original products and counterfeits. Difficult to observe spectral differences. Possible to discriminate original products from counterfeits.

Chemometrics task #3: sample classification Often, datasets consist of samples that belong to several different groups or classes. Different classes may be the result of: samples prepared with different raw material (possibly from different vendors), class of chemical compound (aromatic, aliphatic, carbonyl, etc.), process state (startup, normal, particular faults, etc.), Many methods have been developed for classifying samples based on their measured responses.

Chemometrics task #3: sample classification Cluster analysis or unsupervised pattern recognition Methods that attempt to identify groups or classes without using preestablished class memberships. Only spectra are used for the model development. The best known: PCA, Hierarchical clustering and dendrograms, Classification or supervised pattern recognition. Methods that use known class memberships. + A, B, C Spectra and class memberships are used for the model development. The best known: k-nn : k-nearest Neighbors, LDA : Linear Discriminant Analysis, SIMCA : Soft Independent Modeling of Class Analogy.

Classification or supervised pattern recognition The simplest classification method: k-nearest Neighbors with a very simple spectral dataset (Raman spectra with 2 wavelength, 8 samples in two classes A and B). Objective: predict the class of sample X from these known cases. λ 2 Unknown sample X Procedure: 1. Calculate all distances (ex: Euclidean) between the unknown sample X and all samples in the dataset. 2. Find the k-smaller distances :d 1, d 2, d 3 3. Vote procedure to predict the class. d 1 d 2 d 3 A class B class λ 1 Ex : for k = 3, the lowest distances are d 1, d 2, d 3 d 1 : A class d 2 : A class Unknown sample X A class d 3 : B class

Classification or supervised pattern recognition Exemple from the paper: In clinical environments: Necessary to identify pathogen germs and microorganisms as soon as possible to gain potentially life-saving time. Conventional microbiological methods are time-consuming (cultivation step). Proposed solution: direct analysis of single cells with micro-raman spectroscopy and the development of a classification method to predict strains of bacteria.

Classification or supervised pattern recognition The dataset 29 different bacterial strains. Total number of spectra: 3642. Spectral analysis Confocal micro-raman spectrometer, Spectral range: 537 cm 1 to 3654 cm 1. Laser 532 nm / spot diameter 0.7 µm / integration 60 s. Raman spectra of the 29 strains (different pre-processing) Classification results with an independent dataset Raw data Spike elimination Preprocessing Vector normalization Whittaker smoother Observations: 87,3%: a rather good classification rate considering the 29 strains and the single cell analysis. Even k-nn can be a good classifier (no rule to chose the algorithm / data structure dependent) Importance of pre-processing

Multivariate Curve Resolution, a more advanced tool My point of view: multivariate curve resolution is one of the most important chemometrics method developments for the last 20 years. Objective : Recovery of the response profile of pure components in an unresolved and unknown mixture obtained from evolutionary processes. Roma Tauler, father of the MCR-ALS algorithm. In spectroscopy: from a set of mixture spectra, simultaneous extraction of spectral signatures of pure compounds present in the chemical system and their contributions in each mixture with no a priori. Is there a potential for MCR in spectroscopic imaging?

Multivariate Curve Resolution in imaging spectroscopy The classical method for concentration maps generation Microscope 1) Acquisition of the spectral data cube (ex: mapping). 2) Selection of a specific wavelength of the compound of interest. 3) Signal integration to generate the corresponding chemical map (i.e. a slice of the data cube). Experimental data cube Often we forget that wavelength specificity is a strong hypothesis! λ Sample y x Spectrometer x pixels Absorbance 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Absorbance Absorbance 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500

Multivariate Curve Resolution in imaging spectroscopy Wavelength specificity is a strong hypothesis Some drawbacks: Usually necessary to know all compounds present in the analyzed sample. If a compound is forgotten = possible interferences = overestimations of concentration = false distribution maps = biased vision of analytical reality. impossible to select a specific spectral range for an unexpected compound (no image). Sometimes impossible to find a specific spectral range for each compound due to the complexity of the sample (high number of species) and/or the high bandwidth of the considered spectroscopy. What can be done with Multivariate Curve Resolution?

Multivariate Curve Resolution in imaging spectroscopy MCR-ALS Methodology Unfolding of the experimental data cube. Rank evaluation of the D matrix (nc: number of spectral contributions in the dataset). Simultaneous extraction of C and St Refolding C in order to generate chemical maps. x D Spectral data matrix l Multivariate curve resolution (bilinear decomposition) Simultaneous extraction with no prior knowledge C S T Pure compounds Pure compounds concentration matrix nc Spectral matrix λ y Spectral data unfolding nc λ x y = x y. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Experimental data cube 50 100 150 200 250 300 350 400 450 First pure spectra (molecular characterization) Refolding of the first pure compound concentration distribution (pixels space) y x

Multivariate Curve Resolution in imaging spectroscopy Exemple from the paper: Kidney stones pathology: Important to determine the composition and the distribution of all chemical compounds present in the sample. Understand its history of formation and provide a diagnosis. Institute appropriate therapy in order to reduce future recurrences. Solution: use of Raman macro-imaging and MCR-ALS method: with no a priori, detect and extract pure spectra of all compounds and their corresponding chemical maps.

Multivariate Curve Resolution in imaging spectroscopy Raman macro-imaging Laser: 1064 nm / 100 mw. Spectra range: 400 3500 cm 1 5 s acquisition time per position. Experimental Data cube (3780 spectra) SVD Detection of 4 spectral contributions in the dataset MCR-ALS Simultaneous extraction of pure spectra and chemical maps 1 mm Molecular identification: Uric acid Whewellite Mucin Background (fluorescence) Was there a real specific wavelength for each compound?

The super-resolution concept in Raman spectroscopy Raman imaging: a great potential but Actual trends Desire to analyze always smaller samples (nanosciences) or observe more details on bulk samples. General observations The diffraction limit sets the maximal lateral resolution i.e. the minimal distance d between two objects in order to be considered as resolved (really observed). d = 0,61.(λ / NA) with λ the radiation wavelength and NA the numerical aperture of the optical system. However, it is generally considered that Raman spatial resolution ~ 1 µm rather incompatible with the spectroscopic characterization of micronic samples.

The super-resolution concept in Raman spectroscopy Solutions to push the diffraction limit? Instrumental solution: near field spectroscopy. Possible but not so simple. Our chemometrics solution : apply the super-resolution concept to images acquired on our classical far field spectroscopy. The super-resolution concept Background: Real discipline of the signal processing community (early works in 1980, a real emerging scientific research in 2000). Concept: Simultaneous exploitation of several low resolution images of the same object (observed from different angles ) in order to obtain one higher resolution image. Condition of application: sub-pixellic shift between low resolution images (image shifts lower than the pixel size of low resolution image). Shift < x µm Pixel size = x µm Low resolution image #2 Low resolution image #1

A super-resolution example in space observation NASA's Viking Mission: the mission objectives were to obtain high resolution images of the Martian surface, characterize the structure and composition of the atmosphere and surface, and search for evidence of life. Source : Simultaneous use of 24 shifted images of mars acquired by vicking orbiter 1 Super-resolution 742 m / pixel 186 m / pixel

Super-resolution principle Understand the super-resolution concept: make the link between the ideal High Resolution image (HR) and a Low Resolution image (LR) The image formation / degradation model The observation model describes the direct LR image acquisition process by an imaging degradation system. Ideal object x HR image Move Translation Rotation Blur Optical blur (defocussing, Diffraction limit, optical aberrations) Point spread function PSF (detector is not a point) Undersampling (L1,L2) Noise n i + y i i th observed LR image

Super-resolution principle The observation model with some linear algebra (just 2 slides ) Each low resolution image y i is a shifted, blurred, under-sampled, and noisy version of the x high resolution image. LR image y i = [y i,1,y i,2,,y i,m ] T N 1 pixels N 2 pixels with M = N 1 N 2 HR image x = [x 1,x 2,,x N ] T N 1 L 1 pixels N 2 L 2 pixels with N = N 1 L 1 N 2 L 2 y i = E i.f i.m i.x + n i For 1 i p (p low resolution images are available) Under-sampling matrix E i (N 11 N 2 N 1 L 1 N 2 L 2 ) Blur matrix F i (N 1 L 1 N 2 L 2 N 1 L 1 N 2 L 2 ) Shift matrix M i (N 1 L 1 N 2 L 2 N 1 L 1 N 2 L 2 )

Super-resolution principle Generalization to the p available low resolution images: y 1 y p E 1 F 1 M 1 =.x + E p F p M p y = H.x + n n 1 n p Super-resolution is what we call an «inverse problem» (reciprocal of the modelisation) Objective : retrieve x (the high resolution image) from y (the low resolution images) and an estimation of H matrix Solving this equation is not trivial but mathematical methods exist (see refs.)

Super-resolution for Raman spectroscopic imaging To be published soon A super-resolution example: Raman imaging on industrial dust particles. Visible image Region of interest Micro-Raman analysis 1µm 2 (a classical setup: 1 µm step) Raman spectral data cube (6 x 12 x λ) 10 µm 7000 6000 5000 4000 3000 2000 1000 0 1200 900 72 Raman spectra

Super-resolution for Raman spectroscopic imaging Generation of chemical maps Detection of 5 chemical contributions Na 2 SO 4 7000 6000 5000 4000 3000 2000 1000 1200 900 MCR-ALS PbSO 4 CaSO 4,H 2 O / NaNO 3 A possible molecular identification but a poor spatial resolution for the characterization of such particles. CaSO 4,H 2 O / organic? CaSO 4,2H 2 O MCR-ALS simultaneous extractions

Super-resolution for Raman spectroscopic imaging The multiple acquisition used for the super-resolution concept Acquisition of 25 spectral data cube, generation of 25 low resolution images for each chemical contribution (Sub-pixellic moves between measurements : multiple of 200 nm in x and y direction). 25 low resolution images 25 low resolution images MCR-ALS 25 low resolution images 25 low resolution images Sub-pixellic shift multiple of 200 nm 25 low resolution images

Super-resolution for Raman spectroscopic imaging MCR-ALS + SR: a possible molecular identification and sub-micron spatial resolution. Superresolution Superresolution Superresolution Superresolution Superresolution It is not over-sampling. Mesures with USAF 1951 resolution target : LabRam HR intrinsic spatial resolution : ~600 nm 200 nm with superresolution.

Conclusions Chemometrics is a good way to push the limits of Raman spectroscopy. Provide interesting tools for data exploration, classification or quantification purposes. Multivariate Curve Resolution method offers high potential for the analysis of complex dataset with no a priori. A High degree of adaptability (all evolutionary systems, ex: time resolved). Super-resolution, a way to increase the spatial resolution in spectroscopic imaging (even for advanced instrumental setup).

Acknowledgments Dr Marc Offroy Myriam Moreau Dr Sophie Sobanska Sara Piqueras Dr Anna De Juan Dr Peyman Milanfar Dr Yves Roggo

Super-resolution for spectroscopic imaging To go further with super-resolution NIR spectroscopic imaging Increasing the spatial resolution of near infrared chemical images (NIR-CI): The superresolution paradigm applied to pharmaceutical products. M. Offroy, Y. Roggo, L. Duponchel, Chemometrics and Intelligent Laboratory Systems 117, pp. 183-188 (2012). MIR spectroscopic imaging Chemometric strategies to unmix information and increase the spatial description of hyperspectral images: a single cell case study. S. Piqueras, L. Duponchel, M. Offroy, F. Jamme, R. Tauler, A. de Juan, Analytical Chemistry, accepted (2013). Infrared chemical imaging: Spatial resolution evaluation and super-resolution concept. M. Offroy, Y. Roggo, P. Milanfar, L. Duponchel, Analytica Chimica Acta, 674 (2), pp. 220-226 (2010). Raman spectroscopic imaging Super-resolution and Raman chemical imaging: From multiple low resolution images to a high resolution image. L. Duponchel, P. Milanfar, C. Ruckebusch, J.-P. Huvenne, Analytica Chimica Acta, 607 (2), pp. 168-175 (2008).