Supporting information for: A New Chemometric Approach for Automatic. Identification of Microplastics from. Environmental Compartments Based on

Supporting information for: A New Chemometric Approach for Automatic Identification of Microplastics from Environmental Compartments Based on FT-IR Spectroscopy Gerrit Renner,, Torsten C. Schmidt, and Jürgen Schram, Instrumental Analytical and Environmental Chemistry, Faculty of Chemistry, Niederrhein University of Applied Sciences, Frankenring 20, D-47798 Krefeld, Germany Instrumental Analytical Chemistry and Centre for Water and Environmental Research (ZWU), University of Duisburg-Essen, Universitätsstr. 5, D-45141 Essen, Germany E-mail: schram@hs-niederrhein.de Phone: +49 (0)179 1043284 Estimation of Initial Fit Parameters Asymmetry Factor a One essential step of µident is curve fitting, and in this context, all fit parameters have to be estimated during an initialisation procedure. We decided to use a vibrational band model that respects peak asymmetry, which was presented by Stancik and Brauns. They redefined the static peak width γ by means of a dynamic peak width γ based on a logistic S1

function and established an asymmetry factor a and the dynamic peak width base value γ0. Having a look at the first derivative of an asymmetric peak, the ratio of absolute values of the extrema max( I) and min( I) differ significantly. This phenomenon can be used to estimate the asymmetry factor itself by analysing the corresponding extreme values and calculate their ratio R. In this context, we decided to use the modified Canberra distance due to its closed range of [ 1, +1], as it can be seen in Eq. 1. In addition, Fig. S1 gives an overview of different asymmetry factors and their corresponding peak shapes and ratios R including the first derivative. min(i) max(i) R = min(i) + max(i) = f(a) (1) Figure S1: Peak shapes and their corresponding first derivatives with differing asymmetry fators a. It can be observed that the ratio R changes and an absolute ratio R > 0.5 leads to a significant peak shoulder within the first derivative. However, the ratio R is influenced by the dynamic peak width γ 0 itself, which can be observed in Fig. S2. In this example, the asymmetry factor is kept constant while the dynamic peak S2

width γ 0 varies, and as a result, the ratio R still changes. min(i) max(i) R = min(i) + max(i) = f(a, γ 0) (2) Figure S2: Peak shapes and their corresponding first derivatives with differing dynamic peak widths γ 0 and constant asymmetry factor a. It can be observed that the ratio R changes and an absolute ratio R > 0.5 leads to a significant peak shoulder within the first derivative. In Fig. S1 as well as in Fig. S2 a significant change in the first derivative is recognisable if the absolute ratio R > 0.5. In addition, these kind of vibrational bands cannot be observed in real infrared spectra of microplastics, and therefore, we focussed on the estimation of vibrational bands with an absolute ratio R 0.5. Concerning this matter, 1 10 4 vibrational bands with differing dynamic peak width and asymmetry factor were simulated and the corresponding local extrema were analysed in order to calculate the ratio R. For a suitable estimation of the initial asymmetry factor, an empirical function was defined that describes the depence of the ratio R from the asymmetry factor a and the dynamic peak width γ0. In this context, we tried many different two dimensional logistic functions but a simple error function fits best with the empirical data, which can be seen in Fig. S3. The reason S3

why the experimental ratios R can be described with a logistic function are originated in the Canberra distance that behaves very similar to a logistic function. R= min(i) max(i) erf(p1 a γ0 ) min(i) + max(i) (3) Figure S3: Curve fitting of the experimental data of R,a and γ0 with an two dimensional error function (R2 = 0.9998). The empirical parameter p1 was estimated by means of the Trust-Region algorithm and the corresponding result is given by Eq. 4. p1 = 0.8795 ± 0.0005 (4) In a next step, the error function (Eq. 3) is inverted to isolate the asymmetry factor, which is shown in Eq. 5. R = erf(p1 a γ0 ) S4

with: erf(x) = 2 x e t2 dt (5) π a = erf 1 (R) p 1 γ 0 0 Evaluation of Estimated Fit Parameters The developed empirical procedure to estimate the initial fit parameters was tested with 1 10 4 simulated vibrational bands. During the initialisation, peak position ν 0, peak height A, dynamic peak width γ0 and asymmetry factor a are estimated, and in this test, results of A, γ0 and a are plotted against their set values, which can be seen in Fig. S4. It can be obtained that all parameters are estimated quite well, and the relative mean squared error Rel. M SE ranges between 3.9 4.3 %. S5

Figure S4: Evaluation of estimated fit parameters. A),C) and E) represent theoretical or set values of the peak parameters A, γ 0 and a. All plots indicate a good linear correlation and were fitted by first order linear regression (fit coefficients: a0 [intercept] and a1 [slope]). The overall accuracy seems to be quite well due to the regression line fits almost perfectly with the bisector(intercept=0 and slope=1). B), D) and F) illustrate the relative error distributions of the corresponding fit parameter estimation. In addition to a single parameter illustration, Fig. S5 shows the overall relative error for all evaluated vibrational bands. According to this data, very small vibrational bands cannot be S6

estimated as accurately as bigger vibrational bands. 0.3 Relative Error [%] 0.2 10 9 8 0.1 7 Asymmetry Factor a 0 6 5 4-0.1 3 2-0.2 1-0.3 16 14 12 10 8 0.02 6 4 0 2 Dynamic Peak Width * 0 0.04 0.06 0.08 Peak Height A * 0.1 Figure S5: Evaluation of estimated fit parameters.the peak parameters A, γ 0 and a are illustrated in a three dimensional scatter plot with a certain colour map wich describes the fourth dimension or the overall relative error, respectively. It can be observed that the error rises significantly if the dynamic peak width γ 0 2. Evaluation of the Distance Function d The main concept of the developed distance function is consideration of all vibrational band areas, positions and their statistical significance and therefore, we evaluated an empirical S7

function, which contains four constants k 1 to k 4. In this context, k 1 and k 2 define two distance thresholds for dissimilarity analysis of two different vibrational band positions ν, a lower (k 1 no significant dissimilarity) and an upper one (k 2 95% dissimilarity). On the other hand, k 3 and k 4 define the corresponding limits for dissimilarity analysis of two different normalized vibrational band areas Λ. For estimation, we analysed several different and similar measured vibrational bands from the environmental microplastics spectra (n=300), focussing on their vibrational band data distributions, as shown exemplarily in Fig. S6. Figure S6: Analysis of vibrational band data distributions. In the presented case, two different vibrational bands at 720 and 730 cm 1 were analysed. The distributions of the detected peak positions are defined by a distribution width of 4 to 5 cm 1, which can be estimated using a 4σ interval. The higher limit is given by measuring the distance between these two vibrational bands, as these bands are clearly separated with a corresponding difference of ν 95% 9. Analogously, normalised vibrational band area distributions were analysed, and in the given example, the normalised band area Λ 720,730 has an inter quartile range of 0.16 or a 3IQR interval of 0.48, respectively. The upper limit Λ 95% was set to 1.75. For further estimation, all these information were used to define a system S8

of equations, which is given by Eq. 6. 0.05 = 1 (exp [k 1 5 k 2 ] + 1) 1 0.95 = 1 (exp [k 1 10 k 2 ] + 1) 1 0.05 = 1 (exp [k 3 0.5 k 4 ] + 1) 1 (6) 0.95 = 1 (exp [k 3 1.75 k 4 ] + 1) 1 k 1 = 1.2, k 2 = 9, k 3 = 4.7, k 4 = 5.3 S9

Microplas cs Iden fica on Internal Data Normalisa on Technique Contents Automated Data Preprocessing........................................ 2 Baseline Correction........................................... 2 Noise Level Characterisation...................................... 2 Extract Fingerprint........................................... 3 Detect Peaks.............................................. 3 Initial Fit Parameters.......................................... 5 Curve Fitting.............................................. 6 Output: Positions, Areas and Weightings................................ 7 Asymmetric Pseudo Voigt Function (1st Derivative).............................. 8 Read Function Parameters....................................... 9 Substitute Parameters......................................... 9 Function................................................. 9 Dissmimilarity Function............................................. 9 Filter bands............................................... 10 Significance Weighting......................................... 10 Check For Minimum Number Of Peaks................................. 10 Normalise Peaks............................................ 11 Calculate distances between signals.................................. 11 1 1 / 11

Automated Data Preprocessing function z = mident specfit(x,y) Baseline Correction In a first step, baseline is corrected by differentiation using the Savitzky Golay Algorithm. The polynom order K is set to K=2 and the window size F is set to F=2. K = 2; % polynom order F = 7; % windowsize [, g] = sgolay(k, F); % Savitzky Golay Polynom g = g(:, 2); dx = mean(diff(x)); % step size N = size(x, 1); % number of data points HalfWin = ((F+1)/2) 1; SG1 = zeros(n (F+1)/2, size(x,2)); for n = HalfWin+1:N HalfWin 1 SG1(n, :) = dot(g, y(n HalfWin:n + HalfWin, :)); yderiv = SG1./dx; % 1. order derivative of y xderiv = x(1:size(y, 1) HalfWin 1, :); % corresponding x vector Noise Level Characterisation In order to characterise the noise level of the differentiated spectrum, a histogram analysis is performed. Considering that, the distribution of all differentiated absorbances is fitted using a Lorentzian Model. We define the level of noise based on the peak width, which encloses 68,3% of the entire peak area. binwidth = 3.5 std(yderiv)/size(yderiv,1)ˆ(1/3); % bin width, based on Scott s rule nbins = abs(max(yderiv) min(yderiv))/binwidth; % corresponding number of bins xx = linspace(min(yderiv),max(yderiv),nbins); % bin vector (x data) yy = interp1(unique(yderiv),1:numel(unique(yderiv)),xx, linear ); % bin vector (y data). AUTOMATED DATA PREPROCESSING Lorentzian Model myfunc = @(c,a,s,w,x) a./(((x c).ˆ2. (exp( s (x c)) + 1).ˆ2)./w.ˆ2 + 1); 2 / 11

Hitherto, bin vectors describe the cummultative distribution. To apply a Lorentzian Model, those data have to be differentiated. Additionally, the edges, which contain vibrational band information have to be cutted off, considering that this procedure characterises noise level. xx2 = xx(1: 1) ; yy2 = diff(yy) ; % differentiate bin vector hlp = cumsum(yy2)./sum(yy2); % normalised help vector yy2 = yy2(hlp>.05 & hlp<.95); % cut bin vector (y data) xx2 = xx2(hlp>.05 & hlp<.95); % cut bin vector (x data) Fit the noise distribution with the predefined Lorentzian Model. myfit = fit(xx2,yy2,myfunc,... StartPoint,[0 max(yy2) 0 0],... Upper,[.1 max(yy2) 1.2 1 1],... Lower,[.1 max(yy2).8 1 0],... Algorithm, Trust region ); noise = myfit.w tan(0.34135 pi); % As mentioned, noise is defined by peak width. Extract Fingerprint Assuming that most characteristic vibrational bands can be obtained from fingerprint region, this part of the spectrum is extracted. For further data evaluation limits of 700cm-1 to 1900cm-1 are predefined. lmts = [700 1900]; yderiv = yderiv(xderiv>lmts(1) & xderiv<lmts(2)); % extract fingerprint: y data xderiv = xderiv(xderiv>lmts(1) & xderiv<lmts(2)); % extract fingerprint: x data Detect Peaks. AUTOMATED DATA PREPROCESSING In a first step of peak detection, local extrema have to be determined using the findpeaks function. In order to filter some very small peaks, a minimum peak height ( noise level) is defined as threshold. [imax,locmax] = findpeaks(yderiv,xderiv, minpeakheight,noise); % find local maxima [imin,locmin] = findpeaks( yderiv,xderiv, minpeakheight,noise); % find local minima 3 / 11

In a next step, every local maximum is connected to a local minimum in its close proximity. These pairs will define the inflection points of the vibrational bands in the corresponding raw spectrum. [x0,idx] = sortrows([locmax ; locmin ],1); % sorted vector of all extreme value positions intensity = [imax ones(size(imax,1),1) ; imin zeros(size(imin,1),1)]; % vector of all extreme values intensity = intensity(idx,:); % sorted vector of all extreme values, based on their positions opt = [x0(:,1) intensity]; % sorted matrix, based on extreme value position % [position, intensity, type] % type:1 > local maximum % type:0 > local minimum opt clean = []; The peak matrix opt contains some noisy local extrema that do not describe an inflection point of a significant vibrational band. Those signals have to be eliminated in the following loop. while isempty(opt) if opt(1,3) == 1 hlp = find(opt(:,3) 1); else hlp = find(opt(:,3)); if isempty(hlp) [,idxopt] = max(opt(1:hlp 1,2)); opt clean = [opt clean; opt(idxopt,:)]; opt(1:hlp 1,:) = []; else [,idxopt] = max(opt(1:,2)); opt clean = [opt clean; opt(idxopt,:)]; opt = []; In absorbance mode, every vibrational band starts with a local maximum followed by a local minimum. Considering that, it has to be checked if the peak matrix opt starts with a local minimum or s with a local maximum, respectively. if opt clean(1,3) == 0 opt clean(1,:) = []; if opt clean(,3) == 1 opt clean(,:) = []; Inflection points are combined to new matrices: [position of maximum, position of minimum] & [intensity of maximum, intensity of minimum]. AUTOMATED DATA PREPROCESSING inflptspos = [opt clean(opt clean(:,3)==1,1) opt clean(opt clean(:,3)==0,1)]; inflptsint = [opt clean(opt clean(:,3)==1,2) opt clean(opt clean(:,3)==0,2)]; 4 / 11

Initial Fit Parameters In order to perform curve fitting using first order derivative of asymmetric pseudo Voigt function, suitable initial parameters have to be estimated. Vibtrational band position: x0 x0 = zeros(size(inflptspos,1),1); for u = 1:size(inflPtsPos,1) tmpx = xderiv(xderiv>=inflptspos(u,1) & xderiv<=inflptspos(u,2)); [,idx] = min(abs(yderiv(xderiv>=inflptspos(u,1) & xderiv<=inflptspos(u,2)))); x0(u) = tmpx(idx); Vibrational band width: w w = (inflptspos(:,2) inflptspos(:,1))/2; Vibrational band height: a a = max(inflptsint 2. w,[],2); Asymmetry factor: s R = (inflptsint(:,1) inflptsint(:,2))./(inflptsint(:,1)+inflptsint(:,2)); % Camberra Distance of inflection point intensities a1 = 2.007663533725664; % empirical fit parameter b1 = 56.245235357193230; % empirical fit parameter c1 = 0.030809453318739; % empirical fit parameter s = log(a1./(r. exp(c1. w) + 1) 1)./b1; % empiricial estimation of f s = real(s) imag(s); Shape factor: f f = x0 0+.2; % all shape factors are preset to 0.2. AUTOMATED DATA PREPROCESSING Combine all initial parameters in one matrix initial = [x0 a w f s]; 5 / 11

Filter peaks based on minimum and maximum peak width and minimum peak height minwidth = 2 mean(diff(xderiv)); maxwidth = 100; minheight = noise 3; initial(... initial(:,3)<minwidth... initial(:,3)>maxwidth... initial(:,2)<minheight,:) = []; Set upper and lower limits of fit parameters. up = [initial(:,1) + minwidth,... initial(:,2) 10,... initial(:,3) 10,... ones(size(initial,1),1),... ones(size(initial,1),1) 0.2]; lo = [initial(:,1) minwidth,... initial(:,2).1,... initial(:,3).1,... zeros(size(initial,1),1),... ones(size(initial,1),1) 0.2]; Reshape parameter matrices to parameter row vectors. initial = reshape(initial,1,[]); up = reshape(up,1,[]); lo = reshape(lo,1,[]); Curve Fitting In a next step, the incomming raw data (x,y) are fitted using a non-linear regression. In this context, first order derivative of asymmetric pseudo Voigt function is chosen as fit function. Considering that raw data (x,y) contains not a single but multiple vibrational bands, a cumulative fit function has to be defined. n = length(initial)/5; % number of peaks FCNprefix = ffunc9(x, ; idxarray = arrayfun(@(u) [ p sprintf( %03d,u)],1:n, uni,0); pararray = { p1 c ; p2 A ; p3 w ; p4 mu ; p5 a }; vararray = cell(5,n); for u = 1:n for v = 1:5 vararray{v,u} = [idxarray{u} pararray{v}, ]; FCNstr = strjoin(reshape(vararray,1,[])); FCNstr() = []; FCNstr = [FCNprefix FCNstr ) ]; % fit function. AUTOMATED DATA PREPROCESSING 6 / 11

Options for curve fitting: ft = fittype(fcnstr); options = fitoptions(ft); options.startpoint = initial; options.lower = lo; options.upper = up; options.robust = Bisquare ; options.display = off ; Fit [FIT.curve, FIT.gof] = fit(xderiv,yderiv,ft,options); Output: Positions, Areas and Weightings FITcoefficients = coeffvalues(fit.curve); % read fit parameters heights = FITcoefficients(2:5:); % fitted peak heighths widths = FITcoefficients(3:5:); % fitted peak widths shapes = FITcoefficients(4:5:); % fitted peak shapes asymm = FITcoefficients(5:5:); % fitted asymmetry factors. AUTOMATED DATA PREPROCESSING Fitted peak positions z.positions = FITcoefficients(1:5:) ; 7 / 11

fitted peak areas, calculated by numerical integration z.areas = arrayfun(@(u)... trapz(... linspace(... % x data for integration z.positions(u) 8 widths(u),... z.positions(u)+8 widths(u),... 500),... mypeakfunction2(... % y data for integration linspace(... z.positions(u) 8 widths(u),... z.positions(u)+8 widths(u),... 500),... z.positions(u),... heights(u),... widths(u),... shapes(u),... asymm(u)... )... ),... 1:n) ; Fitted peak weightings z.weightings = zeros(n,1); FIT.data = ffunc9(xderiv,fitcoefficients); % fitted y data for i = 1:n tmprange = xderiv > z.positions(i) 1.64 widths(i) &... xderiv < z.positions(i) + 1.64 widths(i); % the factor 1.64 encloses 90% of peak area z.weightings(i) = z.areas(i)/sqrt(sum((fit.data(tmprange) y1(tmprange)).ˆ2)/... sum((y1(tmprange) mean(y1(tmprange))).ˆ2)); % closing function Asymmetric Pseudo Voigt Function (1st Derivative). ASYMMETRIC PSEUDO VOIGT FUNCTION (1ST DERIVATIVE) function y = ffunc9(x,varargin) 8 / 11

Read Function Parameters This function uses a variable amount of parameters, considering that every additional peak will add a set of five paramerts to the function. varargin = cell2mat(varargin); c = varargin(1:5:); % peak centers a = varargin(2:5:); % peak heights w = varargin(3:5:); % peak widths f = varargin(4:5:); % shape factor s = varargin(5:5:); % asymmetry factor x = repmat(x,1,size(c,2)); % wavenumbers Substitute Parameters In this step, some calculations that are performed multiple times are out sourced in order to decrease computing operations. x = x c; u = exp(s. x); v = u. x; z = w.ˆ2; t = u+s. v+1; Function y = sum(... (24 a. f. z. x. (u + 1). t)./(v.ˆ2 + 2. v. x + 12. z + x.ˆ2).ˆ2... % Lorentzian (a. x. exp( (x.ˆ2. (u + 1).ˆ2)./(8. z)). (f 1). (u + 1). t)./(4. z),... % Gaussian 2);. DISSMIMILARITY FUNCTION Dissmimilarity Function function D = mident dissimilarity(x1,x2,y1,y2,s1,s2) 9 / 11

Filter bands The algorithm compares pairwisely the closeset peaks of the two spectra. In this context, a maximum gap size of 20 cm-1 is preset. allx = pdist2(x1,x2); % Calculate pairwise distance of peak positions. allx(allx > 20) = NaN; % Delete all pairs with a distance greater that 20. hlpidx = sum( isnan(allx),1) > 1; % Check reference for multipletts; allx(allx =min(allx,[],1)) = nan; hlpidx = sum( isnan(allx),2) > 1; % Check sample for multipletts; allx(allx =min(allx,[],2)) = nan; hlpidx1 = nansum(allx,2)==0; % Check sample for unused bands. hlpidx2 = nansum(allx,1)==0; % Check reference for unused bands. X1(hlpIDX1) = []; % Delete all unused sample peak positions. X2(hlpIDX2) = []; % Delete all unused reference peak positions. Y1(hlpIDX1) = []; % Delete all unused sample peak areas. Y2(hlpIDX2) = []; % Delete all unused reference peak areas. S1(hlpIDX1) = []; % Delete all unused sample peak significance values. S2all = S2; % Save reference significance vector. S2(hlpIDX2) = []; % Delete all unused reference peak significance values. Significance Weighting The Weighting factor W is defined as the cumulated significance S2 of the corresponding peak areas Y2 that can be found in the sample spectrum Y1 as well. W = sum(s2)ˆ2/sum(s2all)ˆ2; Check For Minimum Number Of Peaks The dissimilarity function only supports spectra with three or more peaks. If this is not the case, the dissimilarity is set to maximum.. DISSMIMILARITY FUNCTION if numel(x1) > 2 && numel(x2) > 2 10 / 11

Normalise Peaks In this section, all common sample and reference peaks are normalised. For that reason, all possible and unique peak pairs are combined using the Canberra Distance. ry1 = pdist(y1,@(u,v) (u v)./(u+v)) ; % Normalise sample peaks ry2 = pdist(y2,@(u,v) (u v)./(u+v)) ; % Normalise reference peaks rx1a = pdist(x1,@(u,v) u) ; % Create sample peak position vector 1 (normalised). rx1b = pdist(x1,@(u,v) v) ;% Create sample peak position vector 2 (normalised). rx2a = pdist(x2,@(u,v) u) ; % Create reference peak position vector 1 (normalised). rx2b = pdist(x2,@(u,v) v) ;% Create reference peak position vector 2 (normalised). rs1a = pdist(s1,@(u,v) u) ; % Create sample peak significance vector 1 (normalised). rs1b = pdist(s1,@(u,v) v) ;% Create sample peak significance vector 2 (normalised). rs2a = pdist(s2,@(u,v) u) ; % Create reference peak significance vector 1 (normalised). rs2b = pdist(s2,@(u,v) v) ;% Create reference peak significance vector 2 (normalised). rs1 = min([rs1a rs1b],[],2); % Calculate sample peak significance vector (normalised). rs2 = min([rs2a rs2b],[],2); % Calculate reference peak significance vector (normalised). Calculate distances between signals The dinstance or dissimilarity includes peak position and area differences. In addidtion, a weighting vector is used, which corresponds to peak signifiances. dx = sqrt((rx1a rx2a).ˆ2 + (rx1b rx2b).ˆ2); % Normalised peak position difference. dy = sqrt((ry1 ry2).ˆ2); % Normalised peak area difference. dw = mean([rs1 rs2],2); % Weighting vector k1 = 1.2; k2 = 9; k3 = 4.7; k4 = 5.3; % Empirical factors dz = 1 1./((exp(k1 dx k2)+1). (exp(k3 dy k4)+1)); % Normalised peak distance. D = sum(dz. dw/sum(dw))ˆw; % Calculate weighted dissimilarity. else % If sample and reference have no common vibriational band D = 1; % closing function. DISSMIMILARITY FUNCTION 11 / 11