FACULTY OF NATURAL SCIENCES AND MATHEMATICS, SKOPJE. Igor Kuzmanovski

Size: px

Start display at page:

Download "FACULTY OF NATURAL SCIENCES AND MATHEMATICS, SKOPJE. Igor Kuzmanovski"

Delphia Rose
5 years ago
Views:

1 Sts. CYRIL AND METHODIUS UNIVERSITY FACULTY OF NATURAL SCIENCES AND MATHEMATICS, SKOPJE INSTITUTE OF CHEMISTRY Igor Kuzmanovski CHEMOMETRIC ANALYSIS OF URINARY CALCULI SUMMARY OF THE SUBMITTED Ph.D. THESIS Republic of Macedonia, Skopje, 2005

2 Igor Kuzmanovski CHEMOMETRIC ANALYSIS OF URINARY CALCULI Abstract The development of automated methods for determining the composition of urinary calculi is very important since it can facilitate the determination of the factors which influence the occurrence of calculi in the urinary tract and perhaps aid the prevention of their reoccurrence. The infrared spectroscopy, the instrumental technique used throughout in this work, is one of the most suitable experimental tools for analysis of urinary calculi providing information about the exact chemical individuality of the constituents. As a result of the continuous analysis of urinary calculi in our laboratory in last nine years, we found that about 80 % of all urinary calculi consisted of calcium oxalates (whewellite and weddellite) and/or their mixtures as well as of these two substances in mixtures with carbonate apatite, struvite or uric acid. That is the reason why the focus of this work was on development of methods for analysis of urinary calculi composed of these substances. Three methods for classification of the above-mentioned four types of calculi were developed using feed-forward artificial neural networks (ANN), supervised self-organizing maps (SOM) and support vector machines (SVM). None of the models developed by these three chemometric techniques was capable of correctly classifying the spectra of 160 infrared spectra of calculi. That is why genetic algorithms were applied for the selection of the most suitable parameters of the all three models as well as for selection of the most suitable wavenumber regions for classification purposes. Several models developed by supervised SOM and SVM were capable of correct classification of all the calculi, while ANN models were capable to correctly classify 95.3 % of the samples. The determination of the quantitative composition of urinary calculi composed of whewellite, weddellite and carbonate apatite was by using artificial neural networks. The root-meansquare errors for the test samples were: for whewellite, for weddellite and for carbonate apatite. The accuracy of the method was checked using standard-addition method on real samples. The discrepancies between calculated and predicted mass fraction of constituents were in the range acceptable for use of the proposed method in clinical laboratories. Key words: urinary calculi, whewellite, weddellite, uric acid, carbonate apatite, struvite, chemometrics, classification, quantitative analysis, qualitative analysis, principal component analysis, artificial neural networks, self-organizing maps, Kohonen neural networks, support vector machines.

3 INTRODUCTION In the last 20 years, a number of computerized methods for the analysis of urinary calculi were developed [1 6]. Most of them are used for determining the qualitative and semiquantitative composition using databases of infrared spectra of known constituents. Among the chemometric methods for quantitative determination of the composition human urinary calculi are the factor-based methods [7 9] as well as those which use the artificial neural networks (ANNs). In some of these methods [9,10], the accuracy of the method was determined by comparing the results with those obtained by visual inspection of a library of spectra. On the other hand, in [11] the standard-addition method was applied with the belief that the results thus obtained are more reliable and free from subjective errors. In fact, it is known [11 14] that the factor-based methods give best results when the relationship between absorbance and mass fraction of the analytes is close to linear, while the ANN methods are capable to model non-linearities in the A w(b) relationships. The main advantage of ANNs over the traditional non-linear regression techniques is their ability to generate models without a priory knowledge about the modelling function [20]. The work presented here is a part of the wider structural and analytical examination of calculi and their constituents performed in our laboratory in the last nine years [7,11,15,21 29]. As a result of these studies [15] we found that about 80 % of the calculi in Macedonia are composed of calcium oxalates (whewellite and weddellite) and/or their mixtures as well as of these two substances in mixtures with carbonate apatite, struvite or uric acid. That is why in this work the main focus is on development of models for classification of these four types of calculi. Also, here the method is proposed for quantitative analysis of calculi composed of whewellite and weddellite and carbonate apatite. EXPERIMENTAL The infrared spectra of all samples (pure substances, prepared mixtures and real samples of urinary calculi) were recorded on a Perkin-Elmer System 2000 Fourier-transform infrared spectrometer in the cm 1 region with a resolution of 4 and a 1 cm 1 sampling interval. The samples were prepared as KBr pellets (2 mg of homogenized sample and 250 mg spectroscopy-grade KBr). If the maximum value of the absorbance in the recorded spectrum exceeded one, the mass of the sample in the pellet was proportionally reduced in order to achieve the desired maximum value of absorbance.

4 The optimization of the method was carried out on 179 samples prepared from synthetic whewellite, weddellite, carbonate apatite and struvite [16 18], while uric acid was a Merck product. The infrared spectra of the synthesized substances were prepared with those in the digital database of infrared spectra by Dao and Daudon [19]. The comparison showed that the desired constituents have indeed been prepared and that the infrared spectra are of quality comparable to that in the database. The mixtures used for the optimization of the method was prepared according to mixture design presented on Fig. 1. Additional twelve binary mixtures whewellite and weddellite were prepared. a b Figure 1. Experimental designs for mixtures consisted of: (a) whewellite, weddellite and uric acid; (b) whewellite, weddellite and carbonate apatite; (b) whewellite, weddellite and struvite; (A whewellite; Б weddellite; B uric acid, carbonate apatite or struvite) During the work more than 200 infrared spectra of different samples of urinary calculi were collected in the period of The composition of these samples, whenever possible, was determined using target factor analysis. The composition of all the other samples was determined by comparison different spectral regions of the samples with the samples stored in the database of Dao and Daudon [19], followed by visual inspection. RESULTS AND DISCUSSION Application of artificial neural networks for classification of urinary calculi Preprocessing. Before the optimisation of ANNs started, proper preprocessing of the data was applied. The spectra of the mixtures prepared for training the networks as well as the spectra of the urinary calculi were stored in a single data matrix (D). The rows in the data matrix correspond to the infrared spectra of different samples, and columns correspond to different

5 wavenumbers. The spectra were offset-corrected and normalized to unit area. In order to make the further calculation faster * the columns in the D were reduced according to the equation: m 10 di, j + r j= ( m 1) 10 1 d i, m = (1) 10 where d i,j represent the data from normalized matrix, i is the sample number, j represents the absorbance value at a given wavenumber value, while r d i, m is the data for the i-th sample in the reduced data matrix. After that the variables in the reduced data matrix were meancentred and autoscaled. The autoscaled data matrix was used during the optimization of the models that use the genetic algorithms for variable selection. In order to extract as much as possible information from as little as possible data points and to make the training process faster, the principal component analysis (PCA) was applied. The calculated principal components were used for training of the ANNs. The output data were stored in another data matrix. The composition of the samples was expressed using a unit vector with length four: whewellite and weddellite samples were labelled with ( ); whewellite, weddellite and uric acid samples were labelled as ( ); whewellite, weddellite and struvite samples were represented with ( ) and whewellite, weddellite and carbonate apatite samples were labelled as ( ). Prior to training, this data matrix was scaled in the interval from -1 to 1. A three-layered feed-forward neural network with a sigmoid transfer function in the hidden layer and a linear transfer function in the output layer was used. The network architecture with best performances was searched by changing the number of input neurons (principal components) and the number of hidden neurons. The number of output neurons was fixed to four. Each output neuron serves as an indicator for a different type of calculi. The weights and the biases of the networks were initialised according to the Nguyen-Widrow. The weights and biases of the networks were adjusted using the Levenberg-Marquardt algorithm [20] for back-propagation of error. In order to avoid overtraining that could produce networks with poor generalisation abilities, the early stopping procedure was applied [30]. This procedure requires a division of the data in three sets: training, validation and test set. * The optimization of ANNs and self-organizing maps was faster while training the networks with principal components calculated from the infrared spectra instead of using the spectra themselves. However, during the optimization of the networks with genetic algorithms the slowest step was the calculation of PCs. In order to make this step faster we reduced the data matrix from down to a size.

6 When early stopping is used, the weights and biases of the networks are adjusted using the training set. The validation set serves for monitoring the performances during the training process. In the beginning of the training process, the error in the validation set decreases together with the error in the training set. However, when the network starts to overfit the data, the error in the validation set starts increasing. This finding was used to control the generalisation abilities of the networks. The network training was stopped when all the samples were correctly classified or if in ten consecutive iteration cycles the number of misclassified samples in the validation set increased. In the later case, the weights and biases that correspond to the minimum in the validation error are restored. The test set serves to check the performances of the trained network. Genetic algorithms [31 37] were used for variable selection as well as for search of the optimal network architecture. The percentage variances captured by each principal component obtained by the decomposition of the reduced data matrix are presented on Table 1. It can be seen that 15 principal components carry % of the variance in the data matrix. These principal components are usually used to determine the optimal number of input neurons [26]. However, since in our case while using genetic algorithms where the genes are represented using binary strings, the number of possible combination is 2 n where n is a positive integer, we decided to change the number of input neurons from 1 to 16. The optimal number of hidden neurons of the networks was searched in the same range. Together with the absorbances from 100 wavenumbers, we came to chromosomes consisting of one hundred and eight genes (four genes were used to determine the optimal number of hidden neurons and the other four to determine optimal number of input neurons). The initial population of 80 chromosomes was randomly generated. The performances of each chromosome were obtained as an average number of misclassified samples after twenty-fold cross-validation The weights and biases were adjusted using a training set consisting of the 179 prepared mixtures, while the 160 real samples were divided into five subsets. Four of the subsets (80 % of the samples) formed the validation set, while the fifth was used as a test set. During five consecutive optimization of ANNs, each of the five subsets for once was a part of the test set. After five repetitions of the cross-validation, the order of the samples in the data matrix was randomly changed. It should be noted that the produced output of the trained ANN is almost never perfect +1 and -1. That is why we used only the sign of the output signals as an indicator variable for comparison of the obtained output with the expected one.

7 The genetic algorithm was repeated within 450 generations. During each generation, sixteen chromosomes (20 % of whole population) with best performances were kept unchanged. The mating pars were formed with selection of chromosomes according to the roulette wheel selection rule. When this selection rule is used, all chromosomes are placed on the wheel. The fraction of the wheel, which belongs to a given chromosome is proportional to its performance (better performance larger fraction). The use of this technique allows a better propagation of genetic material from chromosomes with better performances to be favoured compared to the rest of the chromosomes. Table 1. Percentage variance captured by the first 16 principal component obtained by decomposition of the data matrix Cumulative Number Percentage percentage of PCs variance variance The genetic material among the chromosomes was changed using the two-point crossover technique. The initial mutation rate was 0.10 and linearly decreased down to 0.05 in generation 150. After that the mutation rate was kept at Because the weights and biases were randomly initialised and also because validation and test set were randomly generated from the real samples (80 % of the samples were stored in the validation set and 20 % of the samples were stored in the test set) before each optimisation, the chromosomes which showed best performances, did not always showed the same performance in the successive generations even if the optimisation was repeated for twenty times.. The best solutions were obtained after several repetitions of the whole optimization procedure by genetic algorithms.

8 The absorbance intervals from four chromosomes with best performances are presented in Fig. 2. The average percentage of misclassified samples for these solutions as well as the optimal number of input and hidden neurons are presented in Table 2. Figure 2. Infrared spectra of five substances used in this study (a whewellite; b weddellite; c uric acid; d struvite; e carbonate apatite) and four best solutions with wavenumber intervals selected by genetic algorithms. The average percentage of misclassified samples was calculated using the test set after two hundred-fold cross-validation. Note that an overall performance varies from 5.3 to 5.9 %. Examining the performances of the classification of different types of calculi, we note that the best accuracy is obtained for whewellite, weddellite and uric acid ( %), as well as for whewellite, weddellite and struvite ( %). The average percentage of misclassified samples for whewellite and weddellite type of calculi varies from %, while the percentage of misclassified samples of whewellite, weddellite and carbonate apatite varies in the interval between 5.9 and 6.8 % and is the highest when compared to the other types of calculi. Table 2. Average percentage misclassified samples in the test set determined by two hundred-fold crossvalidation (ww whewellite and weddellite; wwu whewellite, weddellite and uric acid; wwc whewellite, weddellite and carbonate apatite; wws whewellite, weddellite and struvite) Solution Average misclassification error Overall ww wwu wws wwc Network architecture Input neurons Hidden neurons 1 5,9 7,1 1,6 2,0 6, ,8 4,3 1,1 2,1 8, ,3 3,8 0,8 2,7 7, ,7 4,5 0,9 4,3 7, Classification of urinary calculi by supervised self-organizing maps Self-organizing maps initially have been developed as an algorithm for unsupervised learning [38]. But in the cases where poor class separation is obtained, applying slight modifications of

9 the algorithm could transform SOMs as a tool for supervised classification [38,39]. In order to make SOMs supervised, the input vectors for the samples in the training set d s (in our case the principal components of the corresponding samples), were augmented by a unit vector d u (Fig. 3a) with its components assigned into one of the four classes of urinary calculi. In the present study each 1 in the unit vector was multiplied by the maximal value in the data matrix consisting of PCs extracted from the training set. During the phase of prediction the part of the weight vectors of SOMs that correspond to unit vector is excluded (Fig. 3b). In other words, for each sample in the training set d s the corresponding d u must be used during the training while during the recognition of an unknown sample x only the x s part is compared with the corresponding part of the weight vectors of the trained SOM. a b Figure 3. Illustration of training (a) and prediction phase (b) for supervised self-organizing maps According to the data found in the literature it is recommended that the number of neurons in the map should be nearly equal to the number of samples in the training set and the length and width of the map should be proportional to the magnitude of the first eigenvalues obtained by the decomposition of the training set [38]. The ratio of first two eigenvalues in this case is Having that in mind (and also the recommendations [38,40] that the number of map

10 neurons should be similar to the number of samples in the training set) we started the search for the optimal size of the map. After several trials, we chose a map with a size which was trained using the first six principal components obtained from the mean-centred data matrix. The used map had plain boundary conditions, a hexagonal grid, Gaussian neighbourhood function, and linearly decreasing learning rate. The weight vectors were initialized along the first two principal components obtained by decomposition of the data matrix [38]. The SOMs were trained using the batch training algorithm [40] in two phases [38]: (1) rough training phase which lasted 50 epochs with an initial neighbourhood radius equal to five, a final neighbourhood radius equal to one, a learning rate of 0.5, and (2) fine training phase which lasted 500 epochs, an initial and final neighbourhood radius equal to one and a learning rate of 0.1. After the training was finished, the prediction abilities of the SOMs were examined using the data set consisting of suitably preprocessed infrared spectra of real samples. The regions of the self-organizing map obtained using the principal components calculated from the full spectrum produced good separation of the samples in the training set which can be seen form the unified distance matrix presented in Fig. 4. However, using this map 14 samples were misclassified: one calculus consisting of oxalates was classified as an oxalates uric acid concrement, further 12 calculi consisting of oxalates and carbonate apatite were classified as belonging to the oxalate type of calculi and one calculus consisting of oxalates and carbonate apatite was classified as belonging to the oxalates-struvite type. The distribution of the samples from all four types of calculi on the trained map, together with the misclassified ones is presented in Fig. 5. Figure 4. Unified distance matrix for map trained using principal components obtained from the full spectrum, together with the neurons occupied by the samples of the training set (vv oxalates; vk oxalates and carbonate apatite; vs oxalates and struvite samples and vm oxalates and uric acid samples)

The relatively high number of misclassified samples was the reason why we decided to use (prior to the extraction of principal components) genetic algorithms for the variable selection as well as for

11 The relatively high number of misclassified samples was the reason why we decided to use (prior to the extraction of principal components) genetic algorithms for the variable selection as well as for finding the most suitable map size and training parameters. Figure 5. The distribution of the samples from all four types of calculi on the trained map (a whewellite, weddellite and uric acid samples; b whewellite and weddellite samples; c whewellite, weddellite and struvite samples; d whewellite, weddellite and carbonate apatite samples). Genetic algorithms In order to obtain as good results as possible, genetic algorithms were applied for the wavenumber selection as well as for selecting the most suitable training parameters and map size. In the chemistry literature, the theory and use of genetic algorithms as a variable selection tool has been reported several times [31 37] so that only the procedure used in this work is explained here. An initial population of eighty chromosomes was randomly generated. Each chromosome was represented using a binary vector with length of 126 genes. The first 100 bits in the binary vectors represented absorbances at different wavenumbers, while the presence of the corresponding wavenumber interval was coded with 1 and its absence with 0. Other genes were used for:

12 selection of the most suitable number of principal components (PCs) used for training of the SOMs four genes (from 1 up to 16 PCs); determining of the map size eight genes were used (four genes for length and four genes for width); these parameters were changed in the interval from 4 to 19; determining the optimal number of iteration cycles for the rough training phase (six genes); this parameter was searched in the interval between 1 and 64; finding the optimal number of iteration cycles for the fine training phase; eight genes were used and the number of training cycles in this phase was changed in the interval between 1 and 256 increased by twice the number of training cycles in the rough training phase. The number of misclassified samples obtained by the SOM trained with parameters determined by each chromosome was used as a measure for its fitness. After calculating the fitness of the whole population, sixteen chromosomes (20 % of the total population) with the best performances were selected (in what follows, these chromosomes are referred to as studs). The studs were kept unchanged in each successive generation until a different chromosome(s) produced better performances. In such a case, the better chromosome(s) would replace the stud(s). New offspring chromosomes were then created by the two-point crossover technique, which means that two random values between 1 and 126 were chosen. The values in the parent chromosomes between these two values were exchanged to form new chromosomes. After that, the chromosomes were mutated in order to prevent the genetic algorithm from converging too fast in the search space. The procedure for variable selection using genetic algorithms was repeated several times for six hundred generations with an initial mutation rate of 0.10 in the initial population and linearly decreasing down to 0.05 until generation 300. After that the mutation rate was kept at the 0.05 value. After a few repetitions of the optimization process, several solutions without misclassifications were obtained. The wavenumber regions for some of these chromosomes, together with the infrared spectra of the pure substances, are presented in Fig. 6. The map sizes and the training parameters for these same chromosomes are presented in Table 3. The self-organizing map for solution # 3 (presented in Table 3) together with the distribution of all 160 samples in it are presented in Fig. 7.

13 Figure 6. Selected wavenumber regions for some of the best solutions obtained using genetic algorithms. Table 3. Map sizes and training parameters for some of the obtained solutions using the genetic algorithms No. Principal components Length Width Rough training phase Fine training phase

14 Figure 7. Trained map for solution # 3 (presented in Table 3) together with the distribution of all 160 samples of urinary calculi Classification of urolites by support vector machines In last several years, the support vector machines (SVMs) are gaining more and more popularity among the chemometricians around the world [42 56]. The SVMs are relatively new machine-learning techniques derived from the statistical learning theory [57]. Compared to the traditional neural networks, the SVMs possesse the following advantages: (1) strong theoretical background; (2) high generalization abilities; (3) the solution is always unique; (4) the SVMs do not need to have the network topology determined in advance; (5) the SVMs build the solution on small subset of training samples, which reduces the computational time necessary for the optimization. The root cause that SVM attracts more and more attention is that they adopt the structural risk minimization principle, which has been shown to be superior to the empirical risk minimization principle applied by neural networks [58]. The support vector machines were originally developed as a tool for binary classification while the development of the SVMs for classification of objects into more than two classes is still an interesting area for research in the field of machine learning. Nevertheless, some approaches are used for such purposes and one of the simplest and most often used

15 approaches ( known as one-against-the-rest or one-against-all) was applied for the analysis of our data. In this work, the support vector machines were used for classification of urinary calculi and their performances were compared to those of the back-propagation ANNs and supervised SOMs. Before the search for the optimal models started, the data were mean-centred. The optimization of the SVM requires finding: the most suitable value of the penalty parameter * (C). This parameter makes the tradeoff between maximizing the margin (among the classes) and minimizing the number of misclassified samples in the training set ; the selection of a suitable kernel function and its parameters. In our work a Gaussian radial base function was used as a kernel function: K(x, x ј ) = exp x x i 2 Sufficiently small width (σ) * can concentrate the Gaussian kernels in the neighborhood close to the training points which will produce large number of support vectors in the model and will also have an influence on lowering the generalization performance. Too large value for the kernels for the support vectors which belong to different classes could produce support vectors with overlapping Gaussians which means smaller margin (and smaller generalization performances. With a little help from the data collected from the previously published results where SVMs were used to solve certain chemical problems, we decided to change these two parameters in the following intervals: Gaussian width: ; Penalty parameter: σ 2 * Often during the work 2 1 2σ is replaced by γ We use same approach here. When the values of C are small, most of the samples in the training set are support vectors. In such cases the optimization is computationally inefficient and has poor generalization abilities. Larger values of this parameter will produce models with less support vectors and smaller margins between the classes. For very large values, the algorithm will overfit the training data.

16 The models were trained using (1) the infrared spectra of the samples and (2) the principal components of suitably preprocessed infrared spectra. The performances of the models developed are presented at the following picture. a b Figure 8. Performances of different SVM models trained with (a) infrared spectra and (b) six principal components extracted from infrared spectra. The best models obtained for SVM trained with IR spectra were capable of correctly classifying 88.8 % of the real samples while the number of support vectors (SVs) for these models varied from 57 to 65. Probably due to the ability of the principal component analysis to extract the most common variations in the spectra in the most important PCs in the SVM models trained with PCs better performances were obtained. Also, here the decision hyperplane is defined by less support vectors (43 and 44 SVs), which means that these models are less complex. The fraction of correctly classified spectra of urinary calculi in this case is 95.0 %. The number of SVs as a function of C and γ (1/2σ 2 ) is presented in Fig. 8. It is important to say that the reduction of the size of the data set with PCA did not produce any difference in the speed of the optimization of the models, while as a consequence of the removal of the information specific for different samples in the less important PCs, the support vector machines trained with the extracted PCs showed better performances and their decision hyperplane was defined with less SVs which means that they should have better generalization abilities. * Polynomial kernel function was also used, but for some combinations of penalty parameter and its parameters the algorithm did not converged, so we decided here to present only the results of SVMs trained with Gaussian radial base function.

17 Optimization of SVM with genetic algorithms In the case where the performances depend only of two parameters suitably performed greed search should be enough to find the best model. However, in this case GA was also used for selection of the wavenumbers which are most suitable for classification purposes. For this purpose the initial population of 300 chromosomes with length 126 was formed. The first 100 genes were used for selection of wavenumbers from the spectra most suitable for classification purposes. The next 11 genes were used for selection of the parameter C (in the interval between 2 and 128). Another 11 genes were used for selection of the kernel parameter γ (in the interval of 2-5 up to 2-12 ). When PCs were used for training of the SVMs, four more genes were used for selection of the number of principal components. Both IR spectra and the corresponding PCs were used for training. The procedure of optimization was performed for 600 generations. In each generation 20 % of the chromosomes with best performances were kept unchanged. These chromosomes were used for creation of 240 offspring chromosomes (by two point crossover) in the next generation. The whole optimization procedure was repeated several times. Each optimization lasted about 45 minutes. Some of the results for the obtained models are presented in Table 4, while the selected wavenumber intervals for these models are presented on Fig. 9. Due to the selection of more informative wavenumber regions, all the models presented in Table 4 have less support vectors compared to the models developed using full IR spectrum. We should also underline that the number of PCs for the models presented in this table is relatively high compared to the models developed by back-propagation ANNs and supervised SOMs. Among all the models found by genetic algorithms, only few had less than nine PCs, but these models had considerable larger number of SVs which at the same time means more complicated and smaller decision border. One can see from the Fig. 9 that, since principal component analysis defines the direction of the axes in the directions of maximal variance in the data, the wavenumber regions on Fig. 9b are more concentrated in the regions rich with bands, when compared to models (Fig. 9a) obtained using infrared spectra.

18 Table 4. Some of the SVM models obtained using genetic algorithms for variable selection and optimization. Solution No. Models optimized by IR spectra Models optimized by PCs calculated from IR spectra γ C SV γ C SV No. of PCs 1 0, , , , , , , , , , , , , , , , , , , , , , , , , , , , a b Figure 9. Selected wavenumber regions for some of the best SVM models obtained using genetic algorithms (a models trained using infrared spectra; b - models trained using PCs extracted from infrared spectra) Compared to the other two algorithms (ANN and supervised SOM) SVM is the fastest. While the duration of the optimization of ANN (which lasts for 450 generations and with population of 80 chromosomes) was about nine days, the duration of the optimization of supervised SOMs (which lasts for 600 generations and with population of 80 chromosomes) was about five days, the duration of the optimization of SVMs (which lasts for 600 generation and with population of 300 chromosomes) was only 45 minutes. Determination of the composition of human urinary calculi composed of whewellite, weddellite and carbonate apatite using artificial neural networks All the recorded spectra were exported in ASCII format and stored in one data matrix. The rows in the data matrix corresponded to different samples, and each column to a different wavenumber value. The spectra in the data matrix were offset corrected (the minimum absorbance value in each spectrum was subtracted from the absorbance value at each wavenumber in the cm 1 region). The offset-corrected spectra were normalized to

19 unit area. The data matrix of normalized spectra (D) was used for further analysis. The mass fractions of the constituents of the prepared samples were stored in another data matrix in which each column corresponded to a given sample and each row to a different constituent. Another matrix (D m ), consisting of mean-centred normalized spectra was also used in the course of the optimization. The elements of the latter matrix are defined as follows: d m i,j = d i,j d j, where d i,j is element of D, d m i,j is the corresponding element of D m and d j is the average value of the absorbance at wavenumber j in D. A three-layered feed-forward neural network with a sigmoid transfer function in the hidden layer and a linear transfer function in the output layer was used. In order to improve the training rate and increase the performances of the ANN, an orthogonal transformation of the input variables (the absorbance values) was applied using the principal component analysis [59]. The weights and biases of the networks were initialized according to Nguyen Widrow algorithm. A training set was employed for weights and biases adjustment using the Levenberg Marquardt algorithm [20] for backpropagation of error with the principal components as input data and the mass fractions of constituents as target data. The generalization abilities were controlled by early stopping procedure In order to apply this procedure the prepared mixtures were divided into training, validation and test set (see Fig. 10). The network training was stopped when the performance goal of 10 2 for the root-mean-square errors (RMSE) was reached or when in five consecutive iteration cycles the RMSE in the validation set increased. In the latter case, the weights and biases that correspond to the minimum validation error were restored. The predictive power of the optimized networks was compared using the test set consisting of 17 prepared mixtures (Fig. 10). The root-mean-square errors were used to estimate the prediction error: RMSE m n = i= 1 j= 1 2 ( w w ) / ( m ) i, j i, j n 1/2 where w i,j is the predicted mass fraction of the constituent j in the i-th sample, w i,j the calculated mass fraction of the constituent j the i-th sample, m the number of the samples, and n the number of constituents in the samples in the test set.

20 Figure 10. Division of the mixtures presented on Fig. 1b for optimization of the neural networks: (A) whewellite; (B) weddellite; (C) carbonate apatite; ( ) training set; ( ) validation set; ( ) test set. Neural network optimization The process of finding the network architecture giving the best performances is a delicate and time-consuming task. The optimization of the ANNs presented in this work includes finding the optimal number of hidden and input neurons. The training of each network architecture was repeated 20 times in order to minimize the influence of the initial weights and biases on the performances of the optimized ANN. Number of principal components. The data matrices consisting of normalized infrared spectra (D) and of mean-centred normalized spectra (D m ) were decomposed using the principal component analysis. A total number of 1000 principal components (PCs) was calculated from both matrices. The PCs which collectively carry 100.0% of the variance were used for optimization of different network architectures. In the case of the decomposition of D, the first 12 PCs collectively capture 100.0% of the variance in the data. In the case of decomposition of D m, 100.0% of the variance was carried by 15 PCs. The percentage variance (PV) as well as the cumulative percentage variance (CPV) captured by these PCs are presented in Table 1. It could be noticed that in both cases (D and D m ) most of the variance (more than 93% of the cumulative variance) is captured by the first two PCs. This is probably so because when using spectra normalized to unit area of a three-component mixture (and where, as in our case, the compositions of the mixtures is expressed using mass fraction), only two parts of the normalized area are independent, while the part of the area which corresponds to the third component will always have to complement the area to unity.

21 Table 6. Percentage variance and cumulative by each principal component No. of PCs Normalized data matrix Normalized and mean centred data matrix PV CPV PV CPV 1 86,19 86,19 67,82 67,82 2 7,18 93,37 25,48 93,30 3 4,45 97,82 4,48 97,78 4 2,06 99,88 1,40 99,18 5 0,04 99,92 0,38 99,56 6 0,03 99,95 0,20 99,76 7 0,02 99,97 0,11 99,86 8 0,01 99,98 0,05 99,92 9 0,00 99,99 0,03 99, ,00 99,99 0,02 99, ,00 99,99 0,01 99, ,00 100,00 0,01 99, ,00 100,00 0,00 99, ,00 100,00 0,00 99, ,00 100,00 0,00 100,00 The mean values of the calculated RMSE for the test set as a function of the number of input neurons is presented in Fig. 11. This figure shows that the average performance of the ANNs increase (the RMSE values sharply decrease) as the number of input neurons increases, and reach a minimum RMSE when (in the case of D and D m data matrices) three PCs were used as input data. However, after the minimum for RMSE is reached, the average network performance starts to decrease. Figure 11. Mean RMSE as a function of the number of principal components obtained from normalized ( ) and normalized and mean-centred data matrix ( ). Number of hidden neurons. The number of neurons in hidden layer defines the complexity of the developed model. If the number of neurons in the hidden layer is too small, the ANN will not be able to model the data accurately. If, on the other hand, the number of neurons in hidden layer is too large, the performances of the ANN will deteriorate. The influence of the

22 number of hidden neurons on network performances of different networks architectures is presented on Fig. 12. It shows that the mean values of RMSE in the beginning decreases and after reaching a minimum (at three hidden neurons for ANNs where PCs were obtained from data matrix of normalized spectra and two hidden neurons for ANNs trained with PCs calculated from the mean centred data matrix) begins to increase slightly. In Fig. 5 also the average performances for network architectures as a function of the number of hidden neurons also shows that ANNs trained with PCs calculated from mean centred data matrix have the best average performances. The comparison of the performances of all the network architectures is presented in Fig. 13. The network architectures giving the best average performances for both data matrices used for PCA were found. When the ANNs were trained with PCs calculated from normalized data matrix, the network giving the best performances is that with three input and three hidden neurons. The average RMSE of this network is and it is slightly better than the one obtained by the best network architecture (which has three input neurons and two hidden neurons) found when mean centred data were used (RMSE = 0.073). The same trends on RMSE as a function of number of hidden neurons and principal components used for optimization, could be observed in the case of the network architectures trained by PCs calculated from D. Here, it was found that the network with best average performances is that trained using three PCs and with three neurons in the hidden layer. The calculated RMSE for the constituents of the test set are: for whewellite, for weddellite and for carbonate apatite. Figure 12. Mean RMSE as a function of the number of hidden neurons for normalized ( ) and normalized and mean-centred data matrix ( ).

Figure 13. Mean RMSE for all network architectures (obtained by: a) normalized data matrix and (b) normalized and mean-centred data matrix). Analysis of real samples.

23 Figure 13. Mean RMSE for all network architectures (obtained by: a) normalized data matrix and (b) normalized and mean-centred data matrix). Analysis of real samples. The network giving the best performances (the one with three input neurons and three hidden neurons) was applied for predicting the composition of urinary calculi consisted of whewellite, weddellite and carbonate apatite. The accuracy of the obtained results was checked using the method of standard additions adding an exactly known mass of synthetic whewellite weddellite or carbonate apatite to an exactly known mass of the carefully ground and homogenized calculus. The mass fractions of the constituents in the samples before and after standard addition were determined by the same ANN. Knowing the predicted composition of the samples without standard additions, the mass of the samples used for standard addition as well as the mass of the added standard, the expected mass fractions were calculated. The calculated mass fractions of the constituents with standard addition, the predicted mass fractions for the samples with standard addition and the standard deviations for each of the constituents of the analyzed samples are present in Table 7. In most cases the absolute values of the differences between the predicted and the calculated mass fraction of the constituents of the samples are smaller then The standard deviations of the predicted mass fractions of the constituents vary from up to in the case of sample number 5 where the larger discrepancies between predicted and calculated mass fractions are observed, while according to the data presented in the literature, the methods where standard deviations, varies between and are suitable for determining the composition of urolithes in clinical laboratories [60].

24 Table 7. Calculated and predicted mass fractions for the component of some of the analyzed calculi using ANN with best performances Sample number Calculated mass fraction of constituents after standard addition Whewellite 0,293 0,307 0,301 0,321 0,033 0,368 0,331 0,316 0,310 Weddellite 0,339 0,537 0,436 0,257 0,650 0,347 0,365 0,536 0,451 Carbonate apatite 0,368 0,155 0,263 0,421 0,317 0,285 0,305 0,148 0,239 Predicted mass fraction of constituents after standard addition Whewellite 0,301 0,320 0,264 0,316 0,064 0,342 0,339 0,312 0,261 Weddellite 0,317 0,559 0,358 0,258 0,510 0,307 0,364 0,513 0,455 Carbonate apatite 0,382 0,121 0,378 0,425 0,427 0,351 0,297 0,175 0,284 Difference between calculated and predicted mass fractions of the samples after standard addition Whewellite -0,008-0,012 0,037 0,005-0,031 0,026-0,008 0,005 0,049 Weddellite 0,022-0,022 0,078-0,001 0,140 0,040 0,001 0,022-0,004 Carbonate apatite -0,014 0,034-0,115-0,004-0,109-0,066 0,007-0,027-0,045 Standard deviations for the predicted mass fraction of constituents after standard addition Whewellite 0,050 0,021 0,057 0,065 0,072 0,034 0,059 0,056 0,022 Weddellite 0,029 0,016 0,036 0,036 0,052 0,019 0,039 0,042 0,019 Carbonate apatite 0,046 0,017 0,051 0,057 0,072 0,035 0,050 0,040 0,027 REFERENCES 1. S.H. Kandil, T.A. Abou El Azm, A.M. Gad, M.M. Abdou, Comput. Enhanced. Spectrosc., 3 (1986) M. Berthelot, G. Cornu, M. Daudon, M. Helbert, C. Laurence, Clin. Chem., 33 (1987) C.A. Lehmann, G.L. McClure, I. Smolens, Clin. Chim. Acta, 173 (1988) A. Hesse, M. Gergeleit, P. Schüller, K. Möller, J. Clin. Chem. Clin. Biochem., 27 (1989) G. Rebentisch, M. Doll, J. Muche, Lab. Med., 16 (1992) E. Peuchant, X. Heches, D. Sess, M. Clerc, Clin. Chim. Acta, 205 (1992) I. Kuzmanovski, M. Trpkovska, B. Šoptrajanov, V. Stefov, Vib. Spectrosc., 19 (1999) H. Hobert, K. Meyer, Fresenius J. Anal. Chem., 334 (1992) M. Volmer, A. Block, B.G. Wolthers, A.J. de Ruiter, D.A. Doornbos, W. van der Slik, Clin. Chem., 39 (1993) M. Volmer, A. Block, H.J. Metting, T.H.Y. de Haan, P.M.J. Coenegracht, W. van der Slik, Clin. Chem., 40 (1994) I. Kuzmanovski, Z. Zografski, M. Trpkovska, B. Šoptrajanov, V. Stefov, Fresenius J. Anal. Chem., 370 (2001) C. Borggaard, H.H. Thodberg, Anal. Chem., 64 (1992) E.V. Thomas, D.M. Haaland, Anal. Chem., 62 (1990) K.R. Beebe, B.R. Kowalski, Anal. Chem., 59 (1987) 1007A. 15. I. Kuzmanovski, M. Trpkovska, B. Šoptrajanov, Maked. Med. Pregled, 53 (1999) P. Brown, D. Ackermann, B. Finlayson, J. Cryst. Growth, 98 (1989) D. Ackermann, P. Brown, B. Finlayson, Urol. Res., 16 (1988) M. Santos, P.F. González-Diaz, Inorg. Chem., 16 (1977) N.Q. Dao, M. Daudon, Infrared and Raman Spectra of Calculi, Elsevier, Paris, 1997.

25 20. M.T. Hagan, M. Menhaj, IEEE Transactions on Neural Networks, 1994, p B. Šoptrajanov, G. Jovanovski, V. Stefov, I. Kuzmanovski, Phosphorous, Sulfur and Silicon, 111 (1996) B. Šoptrajanov, G. Jovanovski, I. Kuzmanovski, V. Stefov, Spectrosc. Lett., 31 (1998) B. Šoptrajanov, G. Jovanovski, I. Kuzmanovski, V. Stefov, J. Mol. Struct., 103 (1999) B. Šoptrajanov, I. Kuzmanovski, V. Stefov, G. Jovanovski, Spectrosc. Lett., 32 (1999) B. Šoptrajanov, V. Stefov, I. Kuzmanovski, G. Jovanovski, H. D. Lutz, B. Engelen, J. Mol. Struct., 613 (2002) I. Kuzmanovski, M. Trpkovska, B. Šoptrajanov, V. Stefov, Anal. Chim. Acta, 491 (2003) V. Stefov, B. Šoptrajanov, F. Spirovski, I. Kuzmanovski, H. D. Lutz, B. Engelen, J. Mol. Struct., 689 (2004) I. Kuzmanovski, M. Trpkovska, B. Šoptrajanov, J. Mol. Struct., in press. 29. I. Kuzmanovski, M. Ristova, B. Šoptrajanov, V. Stefov, V. Popovski, Talanta, 62 (2004) H. Demuth, M. Beale, Neural Network Toolbox, Mathworks, Natick, D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. de Noord, Anal. Chem., 67 (1995) R. Leardi, A. Lupiáñez Gonzáles, Chemometr. Intell. Lab. Syst., 41 (1998) K. Hasegawa, Y. Miyashita, K. Funatsu, J. Chem. Inf. Comput. Sci., 37 (1997) B.M. Smith, P.J. Gemperline, Anal. Chim. Acta, 423 (2000) H. Handels, T. Roß, J. Kreusch, H.H. Wolff, S.J. Pöppl, Artif. Intell. Med., 16 (1999) S.S. So, M. Karplus, J. Med. Chem., 39 (1996) H. Yoshida, R. Leardi, K. Funatsu, K. Varmuza, Anal. Chim. Acta, 446 (2001) T. Kohonen, Self-Organizing Maps, 3rd Edition, Springer, Berlin, T. Kohonen, Computer, 21 (1988) J. Zupan, J. Gasteiger, Neural Networks in Chemistry and Drug Design, WCH, Weinheim, W.-P. Tai in F. Fogelman-Soulié, P. Gallinari (eds.), Proc. ICANN'95, Int. Conf. on Artificial Neural Networks, vol. II, EC2, Nanterre, France, 1995, p. II R. Burbridge, M. Trotter, B. Buxter, S. Holden, Comput. Chem., 26 (2001) Y. Cai, X. Lui X. Hu, K. Chou, Comput. Chem., 26 (2001) A.I. Belousov, S.A. Verzakov, J. von Frese, Chemometr. Intell. Lab. Syst., 64 (2002) M. Song, C.M. Breneman, J. Bi, N. Sukumar, K.P. Bennett, S. Cramer, N. Tugcu, J. Chem. Inf. Comput. Sci. 42 (2002) A.I. Belousov, S.A. Verzakov, J. von Frese, J. Chemom., 16 (2002) M.W.B. Trotter, S.B. Holden, Quant. Struct.-Act. Relat., 22 (2003) H.X. Liu, R.S. Zhang, F. Luan, X.J. Yao, M.C. Liu, Z.D. Hu, B.T. Fan, J. Chem. Inf. Comput. Sci., 43 (2003) V.V. Zernov, K.V. Balakin, A.A Ivaschenko, N.P. Savchuk, I.V. Pletnev, J. Chem. Inf. Comput. Sci., 43 (2003) 2048.

26 50. P. Lind, T. Maltseva, J. Chem. Inf. Comput. Sci., 43 (2003) E. Byvatov, U. Fechner, J. Sadowski, G. Schneider, J. Chem. Inf. Comput. Sci., 43 (2003) M.J. Sorich, J.O. Miners, R.A. McKinnon, D.A. Winkler, F.R. Burden, P.A. Smith, J. Chem. Inf. Comput. Sci., 43 (2003) S.R. Amendolia, G. Cossu, M.L. Ganadu, B. Golosio, G.L. Masala, G.M. Mura, Chemometr. Intell. Lab. Syst., 69 (2003) H.X. Liu, R.S. Zhang, X.J. Yao,M.C. Liu, Z.D. Hu, B. T. Fan, J. Chem. Inf. Comput. Sci., 44 (2004) T.C. Martin, J. Moecks, A. Belousov, S. Cawthraw, B. Dolenko, M. Eiden, J. von Frese, W. Köhler, J. Schmitt, R. Somorjai, T. Udelhoven, S. Verzakov, W. Petrich, Analyst, 129, (2004) E. Byvatov, G. Schneider, J. Chem. Inf. Comput. Sci., 44 (2004) C. Cortes, V. Vapnik, Machine Learning, 20 (1995) S.R. Gunn, M. Brown, K.M. Bossley, Lecture Notes Comput. Sci., 1280 (1997) P.J. Gemperline, J.R. Long, V.G. Gregoriou, Anal. Chem., 63 (1991) A. Hesse, G. Sanders, R. Döring, J. Oelichmann, Fresenius J. Anal. Chem. 330 (1988) 372. Published papers on which this thesis is based 1. Igor Kuzmanovski, Mira Trpkovska, Bojan Šoptrajanov, Viktor Stefov, DETERMINATION OF THE COMPOSITION OF HUMAN URINARY CALCULI COMPOSED OF WHEWELLITE, WEDDELLITE AND CARBONATE APATITE USING ARTIFICIAL NEURAL NETWORKS, Anal. Chim. Acta, 491 (2003) Igor Kuzmanovski, Mira Trpkovska, Bojan Šoptrajanov, OPTIMIZATION OF SUPERVISED SELF-ORGANIZING MAPS WITH GENETIC ALGORITHMS FOR CLASSIFICATION OF URINARY CALCULI, J. Mol. Struct., (2005) Igor Kuzmanovski, CHEMOMETRICS AND MORE *, Proceedings to XVIII Congress of Chemists and Technologists of Macedonia, 209, Ohrid, Macedonia, The author has been invited lecturer at 14-th International Symposium Spectroscopy in Theory and Practice: 1. Igor Kuzmanovski, Ana Madevska-Bogdanova, Mira Trpkovska, MULTICATEGORY SUPPORT VECTOR MACHINES FOR CLASSIFICATION OF HUMAN URINARY CALCULI, 14-th International Symposium Spectroscopy in Theory and Practice, 29, Nova Gorica, Slovenia, * Electronic book for Mathcad with detailed explanations of some of the chemometric methods for data analysis.

Analytica Chimica Acta 491 (2003) 211 218 Determination of the composition of human urinary calculi composed of whewellite, weddellite and carbonate apatite using artificial neural networks Igor

27 Analytica Chimica Acta 491 (2003) Determination of the composition of human urinary calculi composed of whewellite, weddellite and carbonate apatite using artificial neural networks Igor Kuzmanovski a,, Mira Trpkovska a, Bojan Šoptrajanov a,b, Viktor Stefov a a Institut za hemija, PMF, Univerzitet Sv. Kiril i Metodij, P.O. Box 162, 1001 Skopje, Macedonia b Makedonska akademija na naukite i umetnostite, 1000 Skopje, Macedonia Received 19 December 2002; received in revised form 10 June 2003; accepted 20 June 2003 Abstract More than half of the analyzed calculi from patients from Macedonia are composed of whewellite, weddellite and carbonate apatite (as single components or in binary or ternary mixtures). In order to develop a simple and satisfactorily reliable method for quantitative analysis of urinary calculi, the possibility was explored to employ artificial neural networks (ANNs) as a tool for such a purpose. By changing the number of input and hidden neurons, a search was made for the three-layered feed-forward ANN which would give the best performance. The root-mean-square errors (RMSE) for the test samples are: for whewellite, for weddellite and for carbonate apatite. The accuracy of the method was checked using standard-addition method on real samples. The discrepancies between calculated and predicted mass fraction of constituents were in the range acceptable for use of the proposed method in clinical laboratories Elsevier B.V. All rights reserved. Keywords: Urinary calculi; Artificial neural network; Infrared spectroscopy; Whewellite; Weddellite; Carbonate apatite 1. Introduction The development of automated methods for the determination of the composition of urinary calculi is very important since it can facilitate the determination of the factors which influence the occurrence of calculi in the urinary tract and perhaps aid the prevention of their reoccurrence. Among the instrumental techniques used for this purpose, the infrared spectroscopy is one of the most suitable ones [1]. Some of the main advantages of this technique are the simplicity of sam- Corresponding author. Tel.: ; fax: address: shigor@iunona.pmf.ukim.edu.mk (I. Kuzmanovski). ple preparation and data collection, the specificity and reliability of the results and the small sample size that makes possible the study of the process of nucleation of the calculi. An additional advantage is the possibility to make a distinction between different hydrates (such as whewellite and weddellite) and to identify rarely appearing substances as constituents [2 5]. In the last 20 years, a number of computerized methods for the analysis of urinary calculi were developed [6 11]. Most of them are used for determining the qualitative and semi-quantitative composition using databases of infrared spectra of known constituents. Among the chemometric methods for quantitative determination of the composition human urinary calculi are the factor-based methods [12 14] as well as those which use the artificial neural networks /$ see front matter 2003 Elsevier B.V. All rights reserved. doi: /s (03)

28 212 I. Kuzmanovski et al. / Analytica Chimica Acta 491 (2003) (ANNs). In some of these methods [14,15], the accuracy of the method was determined by comparing the results with those obtained by visual inspection of a library of spectra. On the other hand, in [16] the standard-addition method was applied with the belief that the results thus obtained are more reliable and free from subjective errors. In fact, it is known [16 19] that the factor-based methods give best results when the relationship between absorbance and mass fraction of the analytes is close to linear, while the ANN methods are capable to model the non-linearities in the A w(b) relationships. The main advantages of ANNs over the traditional non-linear regression techniques is the fact that they are able to generate models without a priory knowledge about the modelling function [20]. Since ANNs and their application in chemistry have been extensively discussed in the chemistry literature [21,22], only the procedure applied for their optimization will be discussed here. The recent statistical examination of the uroliths extracted from patients from Macedonia shows that about 54% of the analyzed calculi were composed of pure whewellite, pure weddellite or pure carbonate apatite or of binary and ternary mixtures of these substances [23]. In view of these findings, our goal was to develop a reliable method for quantitative analysis of such calculi, leaving the more complicated (and, as mentioned, less frequent) cases for further work. Absorbance a b c Wavenum e r/cm -1 Fig. 1. Infrared spectra of (a) whewellite; (b) weddellite and (c) carbonate apatite in the cm 1 region. pared with those in the digital database of infrared spectra by Dao and Daudon [27]. The comparison showed that the desired constituents have indeed been prepared and that the infrared spectra are of quality comparable to that in the database. The optimization of the method was carried out on 58 samples prepared from synthetic whewellite, weddelite and carbonate apatite [21 23]. The mixtures used for the optimization of the method was prepared according to mixture design presented on Fig Experimental The infrared spectra of all samples (pure substances, prepared mixtures and real samples of urinary calculi) were recorded on a Perkin-Elmer System 2000 Fourier-transform infrared spectrophotometer in the cm 1 region with resolution of 4 and 1cm 1 sampling interval. The samples were prepared as KBr pellets (2 mg of homogenized sample and 250 mg spectroscopy-grade KBr). If the maximum value of the absorbance in the recorded spectrum exceeded one, the mass of the sample in the pellet was proportionally reduced in order to achieve the desired maximum value of absorbance. The infrared spectra of the synthesized substances 1 (Fig. 1) were com- 1 The syntheses were carried out according to methods found in the literature [24 26]. C(100 %) A(100 %) B(100 %) Fig. 2. Mixture design for the mixtures used for optimization of the neural networks: (A) whewellite; (B) weddellite; (C) carbonate apatite; ( ) training set; ( ) validation set; ( ) test set.

29 I. Kuzmanovski et al. / Analytica Chimica Acta 491 (2003) Among the calculi consisting of whewellite, weddellite and carbonate apatite which were analyzed in our laboratory [23], those with mass of more than 100 mg were chosen. To each ground and homogenized sample, an exactly known mass (between 10.0 and 20.0 mg) of synthetic whewellite, weddelite and/or carbonate apatite was added. The infrared spectra of the samples were recorded before and after the standard addition Data analysis All the recorded spectra were exported in ASCII format and stored in one data matrix. The rows in the data matrix corresponded to different samples, and each column to a different wavenumber value. The spectra in the data matrix were offset corrected (the minimum absorbance value in each spectrum was subtracted from the absorbance value at each wavenumber in the cm 1 region). The offset-corrected spectra were normalized to unit area. The data matrix of normalized spectra (D) was used for further analysis. The mass fractions of the constituents of the prepared samples were stored in another data matrix in which each column corresponded to a given sample and each row to a different constituent. Another matrix (D m ), consisting of mean-centered normalized spectra was also used in the course of the optimization. The elements of the latter matrix are defined as follows: d m i,j = d i,j d j, where d i,j is element of D, d m i,j is the corresponding element of Dm and d j is the average value of the absorbance at wavenumber j in D. A three-layered feed-forward neural network (Fig. 3) with a sigmoid transfer function in the hidden layer and a linear transfer function in the output layer was used. In order to improve the training rate and increase the performances of the ANN, an orthogonal transformation of the input variables (the absorbance values) was applied using the principal-component analysis [28]. The weights and biases of the networks were initialized according to Nguyen Widrow algorithm. A training set was employed for weights and biases adjustment using the Levenberg Marquardt algorithm [29] for back-propagation of error with the principal-components as input data and the mass fractions of constituents as target data. The generalization abilities of the ANNs during the training process were monitored using a validation set, knowing that the validation error normally decreases during the training process. However, when the network begins to overtrain the data in the training set, the error in the validation set begins to increase. This finding could be used for controlling the generalization abilities of the optimized ANNs. The network training was stopped when the performance goal of 10 2 for the root-mean-square errors (RMSE) was reached or when in five consecutive iteration cycles the RMSE in the validation set increased. In the latter case, the weights and biases that correspond to the minimum validation error were restored. The predictive power of the optimized networks was compared using the test set consisting of 17 prepared Input data Target data Bias Bias Fig. 3. Feed-forward artificial neural network with one input, one hidden and output layer.

30 214 I. Kuzmanovski et al. / Analytica Chimica Acta 491 (2003) mixtures (Fig. 2). The root-mean-square errors were used to estimate the prediction error: m n ( w RMSE = i,j w i,j ) 2 1/2 (1) nm i=1 j=1 where w i,j is the predicted mass fraction of the constituent j in the ith sample, w i,j the calculated mass fraction of the constituent j the ith sample, m the number of the samples, and n the number of constituents in the samples in the test set. The data processing was carried out using the MAT- LAB software package [30] and its Neural Network Toolbox [31]. 3. Results and discussion 3.1. Neural network optimization The process of finding the network architecture giving the best performances is a delicate and time-consuming task. The optimization of the ANNs presented in this work includes finding the optimal number of hidden and input neurons. The training of each network architecture was repeated 20 times in order to minimize the influence of the initial weights and biases on the performances of the optimized ANN Number of principal components The data matrices consisting of normalized infrared spectra (D) and of mean-centered normalized spectra (D m ) were decomposed using the principal-component analysis. A total number of 1000 principal components (PCs) was calculated from both matrices. The auto-scaling (also known as standardization) is not recommended for the analysis of spectroscopic data since if unduly inflates the noise in baseline regions [32] and was not used in the present work. The PCs which collectively carry 100.0% of the variance were used for optimization of different network architectures. In the case of the decomposition of D, the first 12 PCs collectively capture 100.0% of the variance in the data. In the case of decomposition of D m, 100.0% of the variance was carried by 15 PCs. Table 1 Percentage variance and cumulative percentage variance captured by each principal component Number of PCs Normalized data matrix Normalized and mean centered data matrix PV CPV PV CPV The percentage variance (PV) as well as the cumulative percentage variance (CPV) captured by these PCs are presented in Table 1. It could be noticed that in both cases (D and D m ) most of the variance (more than 93% of the cumulative variance) is captured by the first two PCs. This is probably so because when using spectra normalized to unit area of a three-component mixture (and where, as in our case, the compositions of the mixtures is expressed using mass fraction), only two parts of the normalized area are independent, while the part of the area which corresponds to the third component will always have to complement the area to unity. The mean values 2 of the calculated RMSE for the test set as a function of the number of input neurons is presented in Fig. 4. This figure shows that the average performance of the ANNs increase (the RMSE values sharply decrease) as the number of input neurons increases, and reach a minimum RMSE when (in the case of D and D m data matrices) three PCs were used as input data. However, after the minimum for RMSE is reached, the average network performance starts to decrease. 2 Calculated from all network architectures (different number of neurons in the hidden later) and from RMSE values obtained by 20 repetitions of the training of the different network architectures.

I. Kuzmanovski et al. / Analytica Chimica Acta 491 (2003) 211 218 215 3.3. Number of hidden neurons 0.25 The number of neurons in hidden layer defines the complexity of the developed model.

31 I. Kuzmanovski et al. / Analytica Chimica Acta 491 (2003) Number of hidden neurons 0.25 The number of neurons in hidden layer defines the complexity of the developed model. If the number of neurons in the hidden layer is too small, the ANN will 0.20 RMSE Principal components Fig. 4. Mean RMSE as a function of the number of principal components obtained from normalized ( ) and normalized and mean-centered data matrix ( ) RMSE Hidden neurons Fig. 5. Mean RMSE as a function of the number of hidden neurons for normalized ( ) and normalized and mean-centered data matrix ( ). Fig. 6. Mean RMSE for all network architectures (obtained by: (a) normalized data matrix and (b) normalized and mean-centered data matrix).

32 Table 2 Calculated and predicted mass fractions for the component of some of the analyzed calculi using ANN with best performances Sample Calculated mass fraction of constituents after standard addition Whewellite Weddellite Carbonate apatite Predicted mass fraction of constituents after standard addition Whewellite Weddellite Carbonate apatite Difference between calculated and predicted mass fractions of the samples after standard addition Whewellite Weddellite Carbonate apatite Standard deviations for the predicted mass fraction of constituents after standard addition Whewellite Weddellite Carbonate apatite I. Kuzmanovski et al. / Analytica Chimica Acta 491 (2003)

33 I. Kuzmanovski et al. / Analytica Chimica Acta 491 (2003) not be able to model the data accurately. If, on the other hand, the number of neurons in hidden layer is too large, the performances of the ANN will deteriorate [33]. The influence of the number of hidden neurons on network performances of different networks architectures is presented on Fig. 5. It shows that the mean values of RMSE in the beginning decreases and after reaching a minimum (at three hidden neurons for ANNs where PCs were obtained from data matrix of normalized spectra and two hidden neurons for ANNs trained with PCs calculated from the mean centered data matrix) begins to increase slightly. In Fig. 5 also the average performances for network architectures as a function of the number of hidden neurons also shows that ANNs trained with PCs calculated from mean centered data matrix have the best average performances. The comparison of the performances of all the network architectures is presented in Fig. 6. The network architectures giving the best average performances for both data matrices used for PCA were found. When the ANNs were trained with PCs calculated from normalized data matrix, the network giving the best performances is that with three input and three hidden neurons. The average RMSE of this network is and it is slightly better than the one obtained by the best network architecture (which has three input neurons and two hidden neurons) found when mean centered data were used (RMSE = 0.073). The same trends on RMSE as a function of number of hidden neurons and principal components used for optimization, could be observed in the case of the network architectures trained by PCs calculated from D. Here, it was found that the network with best average performances is that trained using three PCs and with three neurons in the hidden layer. The calculated RMSE for the constituents of the test set are: for whewellite, for weddellite and for carbonate apatite Analysis of real samples The network giving the best performances (the one with three input neurons and three hidden neurons) was applied for predicting the composition of urinary calculi consisted of whewellite, weddellite and carbonate apatite. The accuracy of the obtained results was checked using the method of standard additions adding an exactly known mass of synthetic whewellite, weddellite or carbonate apatite to an exactly known mass of the carefully ground and homogenized calculus. The mass fractions of the constituents in the samples before and after standard addition were determined by the same ANN. Knowing the predicted composition of the samples without standard additions, the mass of the samples used for standard addition as well as the mass of the added standard, the expected mass fractions were calculated. The calculated mass fractions of the constituents with standard addition, the predicted mass fractions for the samples with standard addition and the standard deviations for each of the constituents of the analyzed samples are present in Table 2. In most cases the absolute values of the differences between the predicted and the calculated mass fraction of the constituents of the samples are smaller then The standard deviations of the predicted mass fractions of the constituents vary from up to in the case of sample number 5 where the bigger discrepancies between predicted and calculated mass fractions are observed, while according to the data presented in the literature, the methods where standard deviations, varies between and are suitable for determining the composition of urolithes in clinical laboratories [34]. 4. Conclusion The quantitative composition of urinary calculi composed of whewellite, weddellite and carbonate apatite was determined using a feed-forward ANN. In order to make the training procedure faster and to get better results, the principal-component analysis was applied on normalized spectra. The PCs were used as input parameters for training the feed-forward networks. The influence of the number of input neurons (principal components) and number of hidden neurons on the network performances was investigated. This work shows that networks trained with three input and three hidden neurons are capable of prediction of the constituents in the urinary calculi composed of whewellite, weddellite and carbonate apatite with best average performances. This network architecture was used for prediction of the real samples of urinary calculi composed of the whewellite, weddellite and carbonate apatite.

34 218 I. Kuzmanovski et al. / Analytica Chimica Acta 491 (2003) The absolute values of the differences between predicted and calculated mass fractions of the constituents (in most of the cases <0.10) as well as their standard deviations (varying from up to 0.072) shows, that this method is suitable for prediction of the composition of this type of urinary calculi [34]. The procedure described here is consisted of many steps of data treatment (exporting the spectra in ASCII format, normalization of the spectra, principal-component analysis, optimization of the network and prediction of the mass fractions of the constituents using trained network), which could make this method complicated for routine use in clinical laboratories, but the procedure described here could be relatively easy automated if specialized software is developed. This will hide the data treatment under the suitable graphical user interface, which will make the method simpler for use in clinical laboratories. References [1] M. Daudon, R.J. Reveillaud, Presse. Med. 16 (1987) [2] D. Beischer, J. Urol. 73 (1955) 653. [3] K. Try, Scand. J. Urol. Nephrol. 15 (1981) 263. [4] M. Daudon, M.F. Protat, R.J. Reveillaud, Ann. Biol. Clin. 36 (1978) 475. [5] K. Stojanova, I. Petrov, B. Šoptrajanov, Maked. Med. Pregled 24 (1969) 71. [6] S.H. Kandil, T.A. Abou El Azm, A.M. Gad, M.M. Abdou, Comput. Enhanced. Spectrosc. 3 (1986) 171. [7] M. Berthelot, G. Cornu, M. Daudon, M. Helbert, C. Laurence, Clin. Chem. 33 (1987) [8] C.A. Lehmann, G.L. McClure, I. Smolens, Clin. Chim. Acta 173 (1988) 107. [9] A. Hesse, M. Gergeleit, P. Schüller, K. Möller, J. Clin. Chem. Clin. Biochem. 27 (1989) 639. [10] G. Rebentisch, M. Doll, J. Muche, Lab. Med. 16 (1992) 224. [11] E. Peuchant, X. Heches, D. Sess, M. Clerc, Clin. Chim. Acta 205 (1992) 19. [12] I. Kuzmanovski, M. Trpkovska, B. Šoptrajanov, V. Stefov, Vib. Spectrosc. 19 (1999) 249. [13] H. Hobert, K. Meyer, Fresenius J. Anal. Chem. 334 (1992) 178. [14] M. Volmer, A. Block, B.G. Wolthers, A.J. de Ruiter, D.A. Doornbos, W. van der Slik, Clin. Chem. 39 (1993) 948. [15] M. Volmer, A. Block, H.J. Metting, T.H.Y. de Haan, P.M.J. Coenegracht, W. van der Slik, Clin. Chem. 40 (1994) [16] I. Kuzmanovski, M. Trpkovska, B. Šoptrajanov, V. Stefov, Fresenius J. Anal. Chem. 370 (2001) 919. [17] C. Borggaard, H.H. Thodberg, Anal. Chem. 64 (1992) 545. [18] E.V. Thomas, D.M. Haaland, Anal. Chem. 62 (1990) [19] K.R. Beebe, B.R. Kowalski, Anal. Chem. 59 (1987) 1007A. [20] A.R. Barron, IEEE Trans. Inform. Theory 39 (1993) 930. [21] J. Zupan, J. Gasteiger, Neural Networks in Chemistry and Drug Design, Wiley, New York, [22] A. Bos, M. Bos, W.E. van der Linden, Anal. Chem. Acta 256 (1992) 133. [23] I. Kuzmanovski, M. Trpkovska, B. Šoptrajanov, Maked. Med. Pregled 53 (1999) 251. [24] P. Brown, D. Ackermann, B. Finlayson, J. Cryst. Growth 98 (1989) [25] D. Ackermann, P. Brown, B. Finlayson, Urol. Res. 16 (1988) [26] M. Santos, P.F. González-Diaz, Inorg. Chem. 16 (1977) [27] N.Q. Dao, M. Daudon, Infrared and Raman Spectra of Calculi, Elsevier, Paris, [28] P.J. Gemperline, J.R. Long, V.G. Gregoriou, Anal. Chem. 63 (1991) [29] M.T. Hagan, M. Menhaj, IEEE Transactions on Neural Networks, 1994, p [30] MATLAB 5.2, Mathworks. [31] H. Demuth, M. Beale, Neural Network Toolbox, Mathworks, Natick, [32] R. De Maesschalck, F. Estienne, J. Verdú-Andrés, A. Candolfi, V. Centner, F. Despagne, D. Jouan-Rimbaud, B. Walczak, D.L. Massart, S. de Jong, O.E. de Noord, C. Puel, B.M.G. Vandeginste, Int. J. Chem. 2 (1999) 1. [33] F. Despagne, D.L. Massart, The Analyst 123 (1998) 157R 178R. [34] A. Hesse, G. Sanders, R. Döring, J. Oelichmann, Fresenius J. Anal. Chem. 330 (1988) 372.

35 DTD 5 ARTICLE IN PRESS Journal of Molecular Structure xx (2005) Optimization of supervised self-organizing maps with genetic algorithms for classification of urinary calculi Igor Kuzmanovski a, *, Mira Trpkovska a, Bojan Šoptrajanov a,b a Institut za hemija, PMF, Univerzitet Sv. Kiril i Metodij, P.O. Box 162, 1001 Skopje, Macedonia b Makedonska akademija na naukite i umetnostite, 1000 Skopje, Macedonia Received 20 December 2004; accepted 25 January 2005 Abstract Supervised self-organizing maps were used for classification of 160 infrared spectra of urinary calculi composed of calcium oxalates (whewellite and weddellite), pure or in binary or ternary mixtures with carbonate apatite, struvite or uric acid. The study was focused to such calculi since more than 80% of the samples analyzed contained some or all of the above-mentioned constituents. The classification was done on the basis of the infrared spectra in the cm K1 region. Two procedures were used in order to find the most suitable size and for optimizing the self-organizing map of which that using the genetic algorithms gave better results. Using this procedure several sets of solutions with zero misclassifications were obtained. Thus, the self-organizing maps may be considered as a promising tool for qualitative analysis of urinary calculi. q 2005 Published by Elsevier B.V. Keywords: Urinary calculi; Classification; Self-organizing maps; Supervised self-organizing maps; Genetic algorithms; Infrared spectroscopy 1. Introduction Urolithiasis (occurrence of urinary calculi) affects from 4 to 20% of the population according to country [1]. Thus the determination of the urinary calculi composition is important in clinical laboratories because it can provide information about the development of the calculi, about further treatments of the patients and about means (e.g. suitable diet) by which further reoccurrence of the urolithiasis could be prevented. Since the infrared and Raman spectra are characteristic for a given compound [1,2], vibrational spectroscopy is one of the few instrumental methods suitable for analysis of urinary calculi by providing information about the exact chemical individuality of the constituents. Due to the importance of determining the composition of the calculi, many computerized methods have been developed [3 12]. Most of the methods found in the literature are based on * Corresponding author. Tel.: C ; fax: C address: shigor@iunona.pmf.ukim.edu.mk (I. Kuzmanovski) /$ - see front matter q 2005 Published by Elsevier B.V. doi: /j.molstruc comparison of the sample spectra with the library of the spectra of urinary calculi or algorithm schemes where the calculi are classified according to the presence, absence or position of the band maxima, while other are based on principal component analysis [9], target factor analysis [10] and back-propagation neural networks [11,12]. In the last decade self-organizing maps (SOMs) have become a valuable tool for chemometricians [13 22], most often for unsupervised classification purposes [13 20], for process/reaction monitoring [21,22] and as a tool for variable selection [23]. The theoretical background for the self-organizing maps and their application in chemistry is in details described in the literature [24 26]. In this paper, an attempt was made to apply supervised self-organizing maps [26,27] for classification of 160 infrared spectra of four types of urinary calculi consisting of calcium oxalates (whewellite and weddellite) and/or their mixtures as well as of these two substances in mixtures with carbonate apatite, struvite or uric acid. The calculi used in this work have been analyzed in our laboratory since 1996 and these five substances were found in 80% of all analyzed calculi either as single constituents or in binary or ternary mixtures [28].

36 DTD 5 ARTICLE IN PRESS 2 I. Kuzmanovski et al. / Journal of Molecular Structure xx (2005) Experimental The infrared spectra of the samples were recorded (in the cm K1 region) on a Perkin Elmer System 2000 Fourier-transform infrared spectrometer with a resolution of 4cm K1 and sampling interval of 1 cm K1. The samples were prepared as KBr pellets using 2 mg of homogenized sample and 250 mg spectroscopy-grade KBr. If the maximum value of the absorbance in the recorded spectrum exceeded 1, the mass of the sample in the pellet was proportionally reduced in order to achieve the desired maximum value of absorbance. For training of the supervised self-organizing maps (SOMs) prepared mixtures of whewellite, weddellite, carbonate apatite, struvite and uric acid were used. Whewellite, weddellite, carbonate apatite and struvite were synthesized according to procedures found in the literature [29 31] while anhydrous uric acid was a Merck product. The infrared spectra of these substances were compared with those in the database of infrared spectra from Dao and Daudon [1]. The comparison showed that the desired constituents have indeed been prepared and that the infrared spectra are of quality comparable to that in the database. Some of the samples utilized for training were used in our previous works [11,12] but additional 69 mixtures of whewellite and weddellite as well as of whewellite, weddellite and struvite were prepared amounting to a total number of 179 mixtures. The infrared spectra of these mixtures were recorded and used for training of the selforganizing maps. The infrared spectra of the samples of urinary calculi were recorded using the same procedure as that described above. Whenever possible (depending of the size of the calculi) infrared spectra were recorded from different calculi layers and then target factor analysis was used for determination of their qualitative composition [10]. In cases where the sample size did not permit application of the target factor analysis, the composition of the calculi was determined by comparing different regions of the sample spectra with the database by Dao and Daudon [1]. All together, 160 infrared spectra of urinary calculi were recorded and used for evaluating the performances of the optimized SOMs. Among these samples, 47 belonged to the whewellite-weddellite type of calculi, 20 samples to whewellite and weddellite in presence of uric acid, 11 samples consisted of oxalates and struvite, and 82 samples of oxalates and carbonate apatite. 3. Data analysis 3.1. Preprocessing Prior to training the SOMs, the collected data were preprocessed. The infrared spectra were normalized to unit length and were stored in a single data matrix (D). In order to make further calculations faster, the obtained data ware reduced from 1000 to 100 absorbance values according to the following equation: P m10 di;m r jzðmk1þ10c1 d i;j Z (1) 10 where d i,j represents the data from the preprocessed matrix, i is the sample number, j represent the absorbance values at different wavenumbers, while di;m r is data from the i-th sample in the reduced data matrix. Then the variables in the reduced data matrix were autoscaled. In order to extract as much as possible information in as less as possible data points and to make the training process faster, principal component analysis (PCA) was applied Genetic algorithms In order to obtain as good results as possible, genetic algorithms were applied for the wavenumber selection as well as for selecting the most suitable training parameters and map size. It should be pointed out that the genetic algorithms have been proven to be an effective optimization tool [32 34] allowing relatively fast convergence without the need of running every permutation of variables. In the chemistry literature, the theory and use of genetic algorithms as a variable selection tool has been reported several times [35 41] so that only the procedure used in this work is explained here. An initial population of eighty chromosomes was randomly generated. Each chromosome was represented using a binary vector with length of 126 genes. The first 100 bits in the binary vectors represented absorbances at different wavenumber values while the presence of the corresponding wavenumber interval was coded with 1 and its absence with 0. Other genes were used for: selection of the most suitable number of principal components (PCs) used for training of the SOMs four genes (from 1 up to 16 PCs); determining of the map size eight genes were used (four genes for length and four genes for width); these parameters were changed in the interval from 4 to 19; determining the optimal number of iteration cycles for the rough training phase (six genes); this parameter was searched in the interval between 1 and 64; finding the optimal number of iteration cycles for the fine training phase; eight genes were used and the number of training cycles in this phase was changed in the interval between 1 and 256 increased by twice the number of training cycles in the rough training phase. The number of misclassified samples obtained by the SOM trained with parameters determined by each chromosome was used as a measure for its fitness. After calculating

37 DTD 5 ARTICLE IN PRESS I. Kuzmanovski et al. / Journal of Molecular Structure xx (2005) (a) Unit vector Training Input vector Principal component (b) Sample vector Principal component Fig. 1. Illustration of training (a) and prediction phase (b) for supervised self-organizing maps. the fitness of the whole population, sixteen chromosomes (20% of the total population) with the best performances were selected (in what follows, these chromosomes are referred to as studs). The studs were kept unchanged in each successive generation until a different chromosome(s) produced better performances. In such a case, the better chromosome(s) would replace the stud(s). New offspring chromosomes were then created by the two-point crossover technique, which means that two random values between 1 and 126 were chosen. The values in the parent chromosomes between these two values were exchanged to form new chromosomes. After that, the chromosomes were mutated in order to prevent the genetic algorithm from converging too fast in the search space. All the calculations were done in a Matlab environment [42] using the Self-Organizing Maps Toolbox by Vesanto [43] and Genetic Algorithm Toolbox [44] Supervised self-organizing maps Self-organizing maps were initially developed as an algorithm for unsupervised learning. But in the cases where poor class separation is obtained, applying slight modifications of the algorithm could transform SOMs as a tool for supervised classification [27]. In order to make SOMs supervised, the input vectors for the samples in the training set d s (in our case the principal components of the corresponding samples), were augmented by a unit vector d u (Fig. 1(a)) with its components assigned into one of the four classes of urinary calculi. In the present study each 1 in the unit vector was multiplied by the maximal value in the data matrix consisting of PCs extracted from the training set. During the phase of prediction the part of the weight vectors of SOMs that correspond to unit vector is excluded (Fig. 1(b)). In other words, for each sample in the training set d s the corresponding d u must be used during the training while during the recognition of an unknown sample x only the x s part is compared with the corresponding part of the weight vectors of the trained SOM. 4. Results and discussion According to the data found in the literature it is recommended that the number of neurons in the map should be nearly equal to the number of samples in the

38 DTD 5 ARTICLE IN PRESS 4 I. Kuzmanovski et al. / Journal of Molecular Structure xx (2005) 1 6 Fig. 2. Unified distance matrix for map trained by use of principal components obtained from the full spectrum. training set and the length and width of the map should be proportional to the magnitude of the first eigenvalues obtained by the decomposition of the training set [26]. The ratio calculated by the first two calculated eigenvalues in this case is Having that in mind (and also the recommendations [25,26] that the number of map neurons should be similar to the number of samples in the training set) we started the search for the optimal size of the map. After several trials, we chose a map with a size 19!10 which was trained using the first six principal components obtained from the meancentered data matrix. The used map had plain boundary conditions, a hexagonal grid, Gaussian neighborhood function, and linearly decreasing learning rate. The weight vectors were initialized along the first two principal components obtained by decomposition of the data matrix [26]. The SOMs were trained using the batch training algorithm [45] in two phases [26]: (1) rough training phase which lasted 50 iterations with an initial neighborhood radius equal to five, a final neighborhood radius equal to one, a learning rate of 0.5, and (2) fine training phase which lasted 500 iteration cycles, an initial and final neighborhood radius equal to one and a learning rate of 0.1. Fig. 3. The distribution of the samples from all four types of calculi on the trained map (a whewellite, weddellite and uric acid samples; b whewellite and weddellite samples; c whewellite, weddellite and struvite samples; d whewellite, weddellite and carbonate apatite samples).

DTD 5 ARTICLE IN PRESS I. Kuzmanovski et al. / Journal of Molecular Structure xx (2005) 1 6 5 Fig. 4. Selected wavenumber regions for some of the best solutions obtained by use of genetic algorithms.

The regions of the self-organizing map obtained using the principal components calculated from the full spectrum produced good separation of the samples in the training set which can be seen form the

39 DTD 5 ARTICLE IN PRESS I. Kuzmanovski et al. / Journal of Molecular Structure xx (2005) Fig. 4. Selected wavenumber regions for some of the best solutions obtained by use of genetic algorithms. After the training was finished, the prediction abilities of the SOMs were examined using the data set consisting of suitably preprocessed infrared spectra of real samples. The regions of the self-organizing map obtained using the principal components calculated from the full spectrum produced good separation of the samples in the training set which can be seen form the unified distance matrix presented in Fig. 2. However, using this map 14 samples were misclassified: one calculus consisting of oxalates was classified as an oxalates-uric acid concrement, further 12 calculi consisting of oxalates and carbonate apatite were classified as belonging to the oxalate type of calculi and one calculus consisting of oxalates and carbonate apatite was Table 1 Map sizes and training parameters for some of the obtained solutions using the genetic algorithms No. Principal components Length Width Rough training phase Fine training phase Fig. 5. Trained map for solution no. 3 (presented in Table 1) together with the distribution of all 160 samples of urinary calculi. classified as belonging to the oxalates-struvite type. The distribution of the samples from all four types of calculi on the trained map, together with the misclassified ones is presented in Fig. 3. The relatively high number of misclassified samples was the reason why we decided to use (prior to the extraction of principal components) genetic algorithms for the variable selection as well as for finding the most suitable map size and training parameters. The procedure for variable selection using genetic algorithms was repeated several times for six hundred generations with an initial mutation rate of 0.10 in the initial population and linearly decreasing down to 0.05 until generation 300. After that the mutation rate was kept at the 0.05 value. After a few repetitions of the optimization process, several solutions without misclassifications were obtained. The wavenumber regions for some of these chromosomes, together with the infrared spectra of the pure substances, are presented in Fig. 4. The map sizes and the training parameters for these same chromosomes are presented in Table 1. The self-organizing map for solution no. 3 (presented in Table 1) together with the distribution of all 160 samples in it are presented in Fig Conclusions Genetic algorithms can be successfully used for optimization of supervised self-organizing maps and for selection

40 DTD 5 ARTICLE IN PRESS 6 I. Kuzmanovski et al. / Journal of Molecular Structure xx (2005) 1 6 of the most suitable wavenumber regions for classification of human urinary calculi. It has to be emphasized that genetic algorithms could help in the selection of suitable wavenumber regions without a need of any previous spectroscopic knowledge. Furthermore, the results of the optimization could be easily implemented into suitable graphical user interface which could then be of real help for the use of the results presented here in the clinical laboratories, a task on which we are presently working in our laboratory. Acknowledgements The financial support by the Ministry of Education and Science of Republic of Macedonia is gratefully acknowledged. References [1] N.Q. Dao, M. Daudon, Infrared and Raman Spectra of Calculi, Elsevier, Paris, [2] M. Daudon, R.J. Reveillaud, Presse Med. 16 (1987) 627. [3] S.H. Kandil, T.A. Abou El Azm, A.M. Gad, M.M. Abdou, Comput. Enhanced. Spectrosc. 3 (1986) 171. [4] A. Hesse, M. Gergeleit, P. Schüller, K. Möller, J. Clin. Chem. Clin. Biochem. 27 (1989) 639. [5] M. Berthelot, G. Cornu, M. Daudon, Clin. Chem. 33 (1987) [6] C.A. Lehmann, G.L. McClure, I. Smolens, Clin. Chim. Acta 173 (1988) 107. [7] G. Rebentisch, M. Doll, J. Muche, Lab. Med. 16 (1992) 224. [8] E. Peuchant, X. Heches, D. Sess, M. Clerc, Clin. Chim. Acta 205 (1992) 19. [9] H. Hobert, K. Meyer, Fresenius J. Anal. Chem. 334 (1992) 178. [10] I. Kuzmanovski, M. Trpkovska, B. Šoptrajanov, V. Stefov, Vib. Spectrosc. 19 (1999) 249. [11] I. Kuzmanovski, Z. Zografski, M. Trpkovska, B. Šoptrajanov, V. Stefov, Fresenius J. Anal. Chem. 370 (2001) 919. [12] I. Kuzmanovski, M. Trpkovska, B. Šoptrajanov, V. Stefov, Anal. Chim. Acta 491 (2003) 211. [13] P.K. Hopke, X.H. Song, Anal. Chim. Acta 348 (1997) 375. [14] D. Wienke, Y. Xie, P.K. Hopke, Anal. Chim. Acta 310 (1995) 1. [15] R. Goodacre, J. Pygall, D.B. Kell, Chemometr. Intell. Lab. Syst. 38 (1997) 1. [16] J. Zupan, M. Novič, Anal. Chim. Acta 192 (1994) 219. [17] H. Yang, I.R. Lewis, P.R. Griffiths, Spectrochim. Acta 55 (1999) [18] Y.V. Heyden, P. Vankeerberghen, M. Novic, J. Zupan, D.L. Massart, Talanta 51 (2000) 455. [19] I.V. Pletnev, V.V. Zernov, Anal. Chim. Acta 455 (2002) 131. [20] F. Vandeerestraeten, C. Wojciechowski, N. Dupuy, J.-P. Huvenne, Analusis 26 (1998) 57. [21] M. Kolehmainen, P. Rönkkö, O. Raatikainen, Anal. Chim. Acta 484 (2003) 93. [22] C. Ruckebusch, L. Duponchel, J.-P. Huvenne, Anal. Chim. Acta 446 (2001) 257. [23] R. Todeschini, D. Galvagni, J.L. Vílchez, M. del Olmo, N. Navas, Trends Anal. Chem. 18 (1999) 93. [24] J. Zupan, M. Novič, I. Ruisánchez, Chemometr. Intell. Lab. Syst. 38 (1997) 1. [25] J. Zupan, J. Gasteiger, Neural Networks in Chemistry and Drug Design, WCH, Weinheim, [26] T. Kohonen, Self-organizing maps, third ed., Springer, Berlin, [27] T. Kohonen, Computer 21 (1988) 11. [28] I. Kuzmanovski, M. Trpkovska, B. Šoptrajanov, Maked. Med. Pregled 53 (1999) 251. [29] P. Brown, D. Ackermann, B. Finlayson, J. Cryst. Growth 98 (1989) 285. [30] D. Ackermann, P. Brown, B. Finlayson, Urol. Res. 16 (1988) 219. [31] M. Santos, P.F. González-Diaz, Inorg. Chem. 16 (1977) [32] J. Holland, J. Comput. Machinery 3 (1962) 297. [33] B. Kermani, S. Schiffman, H.T. Nagle, IEEE Trans. Biomed. Eng. 46 (1999) 429. [34] C. Henderson, W. Potter, R. McClendon, G. Hoogenboom, Appl. Intell. 12 (2000) 183. [35] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. de Noord, Anal. Chem. 67 (1995) [36] R. Leardi, A. Lupiáñez, Gonzáles, Chemometr. Intell. Lab. Syst. 41 (1998) 195. [37] K. Hasegawa, Y. Miyashita, K. Funatsu, J. Chem. Inf. Comput. Sci. 37 (1997) 306. [38] B.M. Smith, P.J. Gemperline, Anal. Chim. Acta 423 (2000) 167. [39] H. Handels, T. Rob, J. Kreusch, H.H. Wolff, S.J. Pöppl, Artif. Intell. Med. 16 (1999) 283. [40] S.S. So, M. Karplus, J. Med. Chem. 39 (1996) [41] H. Yoshida, R. Leardi, K. Funatsu, K. Varmuza, Anal. Chim. Acta 446 (2001) 485. [42] MATLAB 5.2, Mathworks. [43] J. Vesanto, Intell. Data Anal. 6 (1999) 111. [44] A. Chipperfield, P. Fleming, H. Pohlheim, C. Fonseca, Genetic algorithm toolbox user s guide, University of Sheffield, Sheffield, [45] W.-P. Tai, in: F. Fogelman-Soulié, P. Gallinari (Eds.), Proc. ICANN 95, Int. Conf. Artif. Neural Networks vol. II, EC2, Nanterre, France, 1995, p. II-33.

XVIII Congress of Chemists and Technologists of Macedonia CHEMOMETRICS AND MORE ACE 12 Igor Kuzmanovski Institute of Chemistry, Faculty of Natural Science and Mathematics, Sts.

41 XVIII Congress of Chemists and Technologists of Macedonia CHEMOMETRICS AND MORE ACE 12 Igor Kuzmanovski Institute of Chemistry, Faculty of Natural Science and Mathematics, Sts. Cyril and Methodius University, P.O. Box 162, MK-1001 Skopje,Republic of Makedonija, Introduction An average analytical chemist is well trained to deal with most of the problems that could occur during the use of the laboratory instrumentation. They usually do an excellent job adjusting the parameters of their spectrophotometers in order to achieve an as good as possible atomization of the sample. The one who deals with chromatography usually has excellent experience in adjusting the composition of the mobile phase, flow rate etc. in order to achieve the best possible separation of the analytes of interest with shortest possible duration of the analyse. And all of them have excellent practical skills in order to determine which pretreatment could be most suitable for the specific samples. But not many of them are working in the field of analytical chemistry because of their mathematical background, although they are faced with huge amounts of data each and every day. Typically the statistical education of analytical chemist includes low levels training in statistic, linear regression, statistical significance tests, error and mean comparison and all of that with not much of mathematics in it. When some of them are willing to use some more advanced mathematical techniques, they usually face a rookie wall due to deficiency of background knowledge in the field of mathematics and data analysis. However, it is not necessary to have exact definitions of the mathematical theorems on which the techniques used in chemometrics are based. The important part is to understand the way they work, which kind of data preprocessing is most suitable and which kind of information could be extracted from the experimental data. For some chemometricians, including the author of this text, the key step in understanding the way different chemometric techniques work is their capability of writing the program for the specific algorithm in some programming language (Visual Basic, C, Matlab, Mathcad, etc.). In the beginning of my work in the field of chemometrics I had to move from easy to use Mathcad environment, where the user interface which looks like writing your code on a sheet of paper, to less user friendly environments of Visual Basic, C and Matlab, because at that time there was no any chemometric code written for Mathcad. Although the programming package Mathcad was not widely used as a tool by the chemometricians, its interface is the most suitable for the novices in this field to learn the basics. That is the main reason why this electronic book called Chemometrics and Mode is presented to the chemical community. Fig. 1. Table of content of the book. Ohrid 2004

42 XVIII Congress of Chemists and Technologists of Macedonia ACE 12 Content and Description The book is consisted of several chapters (Fig. 1): (1) Factor Analysis, (2) Signal Processing, (3) Calibration, (4) Chemical Chaos and Theory of Self-Organized Criticality, and (5) Miscellaneous. The chapter about Factor Analysis [1] is the most detailed one. It is one of the tools considered as basic topics which should be understand by the beginners in the chemometrics. It includes a script used for analysis of factors which influence the grades of the students, by abstract factor analysis, which is well explained in order the reader/user to be able to understand the following, let s say, more complicated versions of factor analysis. A chemical example (purification of reflection infrared spectra from appearing nonlinearities) for the same procedure is also explained. Target Factor Analysis is presented as a tool for determination of, not only a number of factors which have influence in variations of absorbances at different wavenumber (which is possible by abstract factor analysis), but also for qualitative analysis of urinary calculi [1,2]. Further three variations of factor analysis (evolving factor analysis - EFA, window factor analysis WFA and window target factor analysis WTFA) which could be employed in chromatography are presented [3]. These methods use the fact that each existing chemical species has a single unique maximum in its evolutionary profile. Beside in chromatography such profiles could be found in reaction kinetics, titrations, variations in ph and so on [3 6] where the instrument response could vary as a function of some parameter. EFA is based on repetitive eigenvalue analysis on set of data matrices generated by evolutionary process. Whenever an absorbing species begins to appear, an eigenvalue from the pool of noise eigenvalue increases in value in relation to its contribution to the enlarged data set. WFA is similar to EFA, however in this case the eigenvalue analysis is preformed on a set of submatrices with constant number of samples in it (or with constant window) from the original one. WTFA (Fig. 2) is variation of WFA with target testing employed to find the desired species in the whole data matrix. The examples on how these algorithms work are presented with examples in chromatography a log(eigenvalue) b time Fig. 2. Results from window target factor analysis on simulated chromatograms (a first eigenvalue target tested with spectra of all six substances present in the chromatogram; b contour plot of the simulated chromatogram) The chapter Calibration is maybe the most important part of the book. It is consisted of examples on how linear regression should be performed, calculation of confidence interval as well as calculation of limit of detection [7]. Simple script with the program which performs multiple linear regression is included in this part with an example on prediction of unit cell parameters in cubic perovskites [8, 9]. The most important part of this chapter are the programs developed for principal component regression (PCR) [10, 11] and partial least squares (PLS) regression [11 14] (with cross-validation routine for selection of latent variables) are one of the basic tools for chemometricians here developed for the first time in Mathcad environment. The examples in this part are presented using data from [15]. At the Signal Processing chapter few different algorithms for smoothing of digitally collected data are presenter. The algorithms for moving average smoothing, moving median smoothing, fast-fourier transform smoothing and Saviysky-Golay smoothing are developed and also examples of Gaussian kernel smoothing is presented. Ohrid 2004

43 XVIII Congress of Chemists and Technologists of Macedonia ACE 12 In the Miscellaneous chapter the procedure for generalization of the resolution function for separation of two chromatographic peaks into certain number of peaks is presented [16]. The procedure could be used for development of automated procedure for determination of optimum parameters for chromatographic separation on more than two analytes. This part also contains the procedure for simulation of square-wave voltammograms of surface redox reactions [17]. The last chapter of this electronic book, Chemical Chaos, is the least chemometric part but it is consisted of some interesting theoretical examples of oscillating reactions, some of them are very simple taken form the contemporary books of physical chemistry [18], while other theoretical models are developed by some of the leading groups in the world that work in the field of chemical chaos [19]. The author hopes that this book will be a valuable tool for all young analytical chemists who are willing to extend and learn some of the basic algorithms in the field of chemometrics and data analysis. The book could be downloaded at: Your suggestions and comments are welcomed. Please leave them at.: Acknowledgement The author is willing to acknowledge the professor Dr. Slobotka Aleksovska, Goran Stojkovic and Vladimir Ivanovski for the permission to use some of their data for preparation of this electronic book. References 1. Malinowski E.R., Factor Analysis in Chemistry, 2nd. Ed. Wiley, New York, Kuzmanovski I., Trpkovska M., Šoptrajanov B., Stefov V., Vib. Spectrosc., 19 (1999) Gemperline P.J., Hamilton J.C., J. Chemometrics, 3 (1989) Gampp H., Maeder M., Meyer C.J., Zuberbuhler A.D., Talanta, 32 (1985) Gampp H., Maeder M., Meyer C.J., Zuberbuhler A.D., Chimia, 39 (1985) Maeder M., Anal. Chem, 59 (1987) Sharaf M.A., Illman D.L., Kowalski B.R., Chemometrics, John Wiley & Sons, New York, Kuzmanovski I., Aleksovska S., Chemometr. Intell. Lab. Syst., 67 (2003) Petruševski V., Aleksovska S., Croat. Chem. Acta, 67 (1994) Brereton R.G., Chemometrics. Data Analysis for the Laboratory and Chemical Plant, John Wiley & Sons, New York, Geladi P., Kowalski B.R., Anal. Chim. Acta, 185 (1986) Wold S., Technometrics, 20 (1978) Wold S., Ruhe A., Dunn W., SIAM J. Sci. Statist. Comput., 5 (1984) Geladi P., Kowalski B.R., Anal. Chim. Acta, 185 (1986) Brereton R.G., Analyst, 125 (2000) Divjak B., Moder M., Zupan J., Anal. Chim. Acta, 358 (1998) Mirceski V., Kuzmanovski I., Bull. Chem. Technol. Macedonia., 18 (1999) Atkins P.W., Fundamentals of Physical Chemistry, Oxford University Press, Scottm S.K., Peng B., Tomlin A.S., Showalter K., J. Chem. Phys., 94, (1991) Ohrid 2004

44 LAB ORA TOIRE CRISTAL Centre de Recherches et d'informations Scientifiques et Techniques Appliquees aui Lithiases Association Loi de 1901 Service de Biochimie A Groupe Hospitalier Necker-Enfants-Malades 149, Rue de Sevres PARIS CEDEX 15 Tel.: Fax.: Directeur: M. Daudon m.daudon@infonie.ft Professor Icko DORGOSKI Faculty of Natural Sciences and Mathematics Sts Ciril and Methodius University 1000 SKOPJE MACEDONIA Paris, June 12,2005 Dear Professor Dorgoski, Please find enclosed my comments on the PhD thesis entitled "Chemometric analysis of urinary calculi" submitted by Igor Kuzmanovski. I think that this researcher has performed a high quality work on chemometric procedures and factorial analysis of urinary calculi. I have already read some of his scientific contributions previously published. We are working in the same field and we also use genetic algorithms to make optimal procedures for automated reading of infrared spectra. The analytical approach developed by Igor Kuzmanovski is an important step to provide automated analytical tools for a number of laboratories able to apply infrared spectroscopy for stone analysis throughout the world. Yours sincerely, Peny5JT"'lftl Mqll~I.\OHif;t1 YRMBep311rer"CB. kllt,",'jjl 1 METOAKJ- PMPOAHO MareMarW,IYI{ $ahyjltet q.o. CHonJE -~'-,~.,"",,"""w, ".~ h~~.::~.. CJ?-- q~ -.~.:L OJ)(.CAW;"';:',". 'i" G"... OilY! -]

45 LABORATOIRE CRISTAL Centre de Recherches et d'informations Scientifiques et Techniques Appliquees aux Lithiases Association Loi de 1901 Service de Biochimie A Groupe Hospitalier Necker-Enfants-Malades 149, Rue de Sevres PARIS CEDEX 15 Tel.: Fax.: Directeur: M. Daudon m.daudon@infonie.fr COMMENTS ON THE PhD THESIS SUBMITTED BY IGOR KUZMANOVSKI: "CHEMOMETRIC ANALYSIS OF URINARY CALCULI"... The aim of the thesis was to develop automatic procedures for quantitative analysis of urinary calculi based on genetic algorithms and artificial neural networks. The stone material included whewellite, weddellite and their mixture as well as these two components in ternary mixtures with a third component: carbapatite, struvite or uric acid. Thus, all these combinations of stone components are covering probably more than 80% of all urinary calculi. Considering that few reports have been published on the quantitative aspect of stone composition, this work is of clinical importance because proportions of components within a stone may be relevant for etiological purpose. Nevertheless, it must be taken in mind that an increasing number of stones are broken within the urinary tract by extracorporeal shock wave lithotripsy or endocorporeallithotripsy and that quantification of the components in collected stone fragments may be not so relevant than in case of spontaneously passed calculi or stones surgically removed. Despite the changes in urological practice, procedures for quantifying proportions of stone components are needed in laboratories which routinely analyse urinary calculi. Three problems must be considered: first, the large variability in stone composition, because most samples are containing three components or more; second, the large scale of the relative proportions of the components in the mixtures; third, the influence of components which are not included initially in the models developed for automated analysis, these components being capable to alter the accuracy of the model. Underlining the weakness of various factorial methods based on partial least squares, or principal component regression, the author has made the choice to develop an analytical procedure based on artificial neural networks (ANN) because of their better capacity to predict stone composition following an appropriate training step. The spectra were prepared from synthetic compounds, which may constitute a weakness of the study because both whewellite and especially carbapatite from calculi have infrared spectra with significant differences when compared to synthetic samples. Nevertheless, that procedure is a convenient one for creating a large number of synthetic mixtures which may be difficult to obtain from natural calculi. All spectra were normalised to unit area and the spectral data stored in a data

46 matrix. Three sets of samples were used: a training set, a validation set and a test set. As expected, ANN provided better results than other procedures, but did not provide an accurate classification of all samples from the test set. Then the author used genetic algorithms to improve the results. This is an important step of the study because genetic algorithms allow to select minimal criteria required for improving the fitness of the system and the network architecture in order to accurately classify unknown samples tested following the training procedures. As expected, the results were more accurate for mixtures of calcium oxalates and uric acid than for samples including mixtures of whewellite, weddellite and carbapatite. However, the level of correct classification may be considered very high by comparison to previously reported results, including classifications based on ANN. In order to improve his results, Igor Kuzmanovski decided to use complementary procedures based on supervised self-organizing maps and support vector machines more recently proposed to make optimal the results of ANN classification methods. This is an original point of his work. Based on selected chromosomes containing the useful spectral information, the author pertinently improved the training step of the ANN procedure. The time required for each optimization procedure using support vector machines (SVM) was significantly shortened by comparison to the other procedures. A minimal number of hidden neurons and output neurons needed for accurately identify unknown samples was established. The SVM procedure has significant advantages over classical ANN because only one solution is given, a smaller subset of training samples is needed to build the solution and the procedure has higher generalization abilities. Developing a sequence of data optimization including ANN, SVM and genetic algorithms procedures, the author could significantly improve the classification of natural human urinary calculi composed of whewellite, weddellite and carbapatite. This is a very important result which illustrates the high quality of the thesis. However, it could be kept in mind that more than 25% of stones are composed of complex mixtures including 4 or more components (up to 9 components). Thus, it could be of interest to discuss the possible disturbances (perhaps only minor disturbances?) in the model due to components which have not been included in the training procedures of the ANN. " In conclusion, I think this work is an important contribution to stone analysis, providing an accurate tool for quantifying stone components in a great majority of mixed urinary calculi. It could be useful to correlate both qualitative and quantitative stone composition with etiological factors involved in stone formation. The procedure developed in this thesis opens new pathways for the automated reading of infrared spectra recorded from urinary stones analysed by Fourier transform infrared spectroscopy. Paris, June 12, 2005 Michel ]?)audon,phd Chief of the Laboratoire CRISTAL

Classification of Urinary Calculi using Feed-Forward Neural Networks

RESEARCH ARTICLE I. Kuzmanovski, K. Zdravkova and M. Trpkovska, 12 Classification of Urinary Calculi using Feed-Forward Neural Networks Igor Kuzmanovski, a* Katerina Zdravkova b and Mira Trpkovska a a