CHEMOMETRIC CHARACTERIZATION OF ENVIRONMENTAL AND FOOD SAMPLES KEMOMETRIJSKA KARAKTERIZACIJA OKOLJSKIH IN PREHRANSKIH VZORCEV

Size: px
Start display at page:

Download "CHEMOMETRIC CHARACTERIZATION OF ENVIRONMENTAL AND FOOD SAMPLES KEMOMETRIJSKA KARAKTERIZACIJA OKOLJSKIH IN PREHRANSKIH VZORCEV"

Transcription

1 UNIVERSITY OF MARIBOR FACULTY OF CHEMISTRY AND CHEMICAL ENGINEERING DOCTORAL DISSERTATION CHEMOMETRIC CHARACTERIZATION OF ENVIRONMENTAL AND FOOD SAMPLES DOKTORSKA DISERTACIJA KEMOMETRIJSKA KARAKTERIZACIJA OKOLJSKIH IN PREHRANSKIH VZORCEV KATJA ŠNUDERL MARIBOR, January 2009

2 UNIVERSITY OF MARIBOR FACULTY OF CHEMISTRY AND CHEMICAL ENGINEERING KATJA ŠNUDERL CHEMOMETRIC CHARACTERIZATION OF ENVIRONMENTAL AND FOOD SAMPLES DOCTORAL DISSERTATION KEMOMETRIJSKA KARAKTERIZACIJA OKOLJSKIH IN PREHRANSKIH VZORCEV DOKTORSKA DISERTACIJA Maribor, January 2009

3 II Doctoral dissertation CHEMOMETRIC CHARACTERIZATION OF ENVIRONMENTAL AND FOOD SAMPLES Student: Katja ŠNUDERL Education program: Chemical Engineering Adviser: Assoc. Prof. Dr. Darinka BRODNJAK-VONČINA Co-adviser: Full Prof. Dr. Jan MOCAK DECLARATION Herewith, I declare that this Thesis is a result of my own research work. Contributions of others are marked. The literature used was explored using following elements: Sources: ProQuest Digital Dissertations, Science Direct, Compendex & Inspec Database and ISI Web of SCIENCE Keywords: chemometric classification, chemometric characterization, principal component analysis, discriminant analysis, cluster analysis, canonical correlations, neural networks, white wine, air samples, oncology data, morphine, mineral water, pumpkin oil, food samples, environmental samples Time period: up to 2008 No. of references: 92 No. of read abstracts: 362 No. of read articles: 134 No. of reviewed books 12 Maribor, January 2009 Signature of the Student Katja ŠNUDERL

4 III

5 IV To explain all nature is too difficult a task for any one man or even for any one age. `Tis much better to do a little with certainty, and leave the rest for others that come after you, than to explain all things. Sir Isaac Newton ( ), English Mathematician, Physicist und Astronomist

6 V Acknowledgements I would like to acknowledge my supervisors, Full Prof. Dr. Jan Mocak and Assoc. Prof. Dr. Darinka Brodnjak-Vončina for their undying support, both academic and moral. Their love and devotion to research is truly inspiring and I will always be grateful to them for believing in my potential and granting me the opportunity to work and gain experience in their labs. I would also like to acknowledge my family, friends and working colleagues for their immense support and helpful suggestions on several presentations and paper reviews. This research was supported by the Slovenian Research Agency that financed my work as a young researcher at the Faculty of Chemistry and Chemical Engineering, by grants from the Project No. 37s15 granted by Aktion Oesterreich-Slowakei during the years 2002 and 2003, and by the Slovak Grant Agency VEGA (Projects VEGA 1/9129/02, VEGA No. 1/2464/05, VEGA 1/3584/06 and APVV ).

7 VI Key Word Documentation DN Dd DC UDC: 542:311 (043.3) CX chemometric classification, chemometric characterization, cluster analysis, principal component analysis, canonical correlations, discriminant analysis, neural networks, white wine, air samples, oncology data, morphine, mineral water, pumpkin oil, food samples, environmental samples. CC AU ŠNUDERL, Katja AA BRODNJAK-VONČINA, Darinka (advisor) MOCAK, Jan (co-advisor) PP SI-2000 Maribor, Smetanova 17 PB University of Maribor, Faculty of Chemistry and Chemical Engineering PY 2008 TI CHEMOMETRIC CHARACTERIZATION OF ENVIRONMENTAL AND FOOD SAMPLES DT Doctoral dissertation NO 125 pages, 30 figures, 15 tables, 92 ref. LA en AL en/sl Extended abstract Gathered data were processed with the use of following chemometric methods: statistics (description and prediction of the sample or measurement properties and populations), visualisation (representation and projection of the complex multidimensional data into a 2-D space), partition (assortment and definition of learning-, test- and control-samples), classification (sample forecast to a category), modelling (quantitative foretelling of the object properties), optimisation (assortment and definition of the most appropriate conditions, variables and properties at given border requirements). The Thesis reveals different fields of research: (1) research of food products, (2) research in environmental chemistry (3) research in clinical chemistry considering oncology patients undergoing different laboratory tests, and (4) contemporary use of chemometrics, statistics and informatics. Using chemometric techniques the following tasks were performed: (a) characterisation and classification of pumpkin seed oils regarding their chemical and sensorial properties, (b) multivariate analysis of wine samples concerning classification and determination of origin and vintage of different wine sorts, (c) characterisation and classification of mineral waters with regard to their physico-chemical properties, (d) assessment of the content of pollutants monitored in the collected air samples in order to find the correlations among the concentrations of the selected chemical components in air as well as selected meteorological parameters, (e) evaluation of the correlations among the concentration of morphine and its metabolites in the serum of oncology patients aimed to find their dependency on daily drug doses, sort of drug, personal characteristics and gender of patient, type of tumour and traditional biochemical parameters.

8 VII Due to its systematic approach the dissertation introduces new cognition in a practical field of categorisation and classification of larger data sets in different analytical areas. Using multivariate chemometric methods the following original scientific contributions were achieved: - Determination of quality of the selected pumpkin seed oils using measured UV-Vis, NIR in FTIR spectra at different wavelengths and performed sensorial data analysis (evaluation with regard to taste, odour and colour). - Composition of classification models of wine samples and determination of their authenticity with regard to different criteria (vintage, producer, and sort of wine) connected to sensorial evaluation of wine samples (evaluation with regard to taste, colour and bouquet). - Classification, chemometric characterisation and comparison of 50 different mineral water samples (37 from Slovenia and 13 from other European countries) with regard to their physical parameters and ion contents. - Elucidation of correlations between concentrations of morphine and its metabolites in blood serum taken from oncology patients under treatment. Determination of dependency of the above mentioned concentrations upon administered drug (daily doses, sort of medicament), patients personal characteristics and gender, tumour location and measured biochemical parameters. - Composition of mathematical models and graphical review of the correlations among the air pollutants and meteorological conditions valid at the time of sampling. In the Thesis the use of the following multivariate methods of data analysis is described: correlation analysis, analysis of variance (ANOVA), cluster analysis (CA), principal component analysis (PCA), canonical correlation analysis (CCA), linear discriminant analysis (LDA) and logistic regression (LR). These chemometric methods were performed with use of different software packages like TeachMe, Statgraphics Plus, Systat, SAS, MS Excel, and other. UDC: 542:311 (043.3) Keywords: chemometric classification, chemometric characterization, cluster analysis, principal component analysis, canonical correlations, discriminant analysis, neural networks, white wine, air samples, oncology data, morphine, mineral water, pumpkin oil, food samples, environmental samples.

9 VIII Ključna dokumentacijska informacija ŠD Dd DK UDK: 542:311 (043.3) KS kemometrijska klasifikacija, kemometrijska karakterizacija, analiza grupiranja, metoda glavnih osi, kanonska korelacijska analiza, diskriminantna analiza, nevronske mreže, bela vina, vzorci zraka, onkološki podatki, morfij, mineralne vode, bučna olja, prehranski vzorci, okoljski vzorci. KK AV ŠNUDERL, Katja SA BRODNJAK-VONČINA, Darinka (mentorica) MOCAK, Jan (komentor) KZ SI-2000 Maribor, Smetanova 17 ZA Univerza v Mariboru, Fakulteta za kemijo in kemijsko tehnologijo LI 2008 IN KEMOMETRIJSKA KARAKTERIZACIJA OKOLJSKIH IN PREHRANSKIH VZORCEV TD Doktorska disertacija OP 125 strani, 30 slik, 15 preglednic, 92 virov IJ en JI en/sl Razširjeni povzetek Kemijske podatke smo obdelali z naslednjimi kemometrijskimi metodami: statistika (opisi in napovedi lastnosti vzorcev (meritev) in populacij), vizualizacija (pregled in predstavitev kompleksnih več-dimenzionalnih podatkov v 2D prostoru), razdelitev (izbor in določitev učnih, testnih in kontrolnih nizov podatkov), klasifikacija (napoved kategorije, ki jim pripadajo neznani vzorci), modeliranje (kvantitativno napovedovanje lastnosti objektov), optimizacija (izbor in določitev najugodnejših pogojev, spremenljivk in lastnosti pri danih robnih zahtevah). Naloga pokriva več področij: (1) raziskavo na področju prehrambnih izdelkov, (2) raziskavo na področju okoljskih meritev in (3) raziskavo na področju onkoloških laboratorijskih testov in raziskav ter (4) moderno uporabo kemometrije, statistike in informatike. Združili smo rezultate meritev različnih parametrov, dobljenih z uporabo analiznih instrumentalnih tehnik, s kemometrijskimi tehnikami in dano programsko opremo. V okviru tega smo izvedli (a) karakterizacijo in klasifikacijo bučnih olj glede na njihove kemijske in senzorične lastnosti, (b) multivariatno analizo vin z namenom klasifikacije in dokaza pristnosti vzorcev vina glede na njihovo poreklo, letnik ali proizvajalca, (c) karakterizacijo in klasifikacijo mineralnih vod glede na njihove kemijsko-fizikalne lastnosti. Z uporabo kemometrijskih tehnik smo obdelali zbrane rezultate meritev vsebnosti onesnaževal v vzorcih zraka. Uporabili smo rezultate monitoringa. Eden od ciljev raziskave je bil najti povezavo med koncentracijami izbranih kemijskih komponent v zraku in meteorološkimi parametri. Z multivariatnimi metodami smo obdelali rezultate analiznih in biokemijskih testov ter osebne karakteristike onkoloških bolnikov, ki so prejemali analgetike, kateri vsebujejo morfij. Da bi bolje razumeli in dobili možnost napovedi odpornosti bolnikov na bolečino, smo s pomočjo kemometrijskih orodij pojasnili povezave med koncentracijami morfija in njegovih metabolitov v krvnem serumu ter njihovo odvisnost od podrobnosti zdravljenja bolnikov (dnevna doza zdravil, vrsta

10 IX zdravila), osebnih karakteristik in spola, vrste tumorja in običajnih izmerjenih biokemijskih parametrov. Doktorska disertacija zaradi sistematične obravnave predstavlja nova spoznanja na praktičnem področju kategorizacije in klasifikacije večjih nizov podatkov na različnih področjih. S študijo z multivariatnimi metodami smo dosegli prispevke k znanosti v naslednjih smereh: - določitev kakovosti izbrane vrste bučnih olj na osnovi izmerjenih UV-Vis, NIR in FTIR spektrov pri različnih valovnih dolžinah in opravljene senzorične analize vzorcev (ocena glede na okus, vonj in barvo). - izdelava modelov za klasifikacijo belih vin in določitev njihove pristnosti glede na različne kriterije (vrsta vina, letnik, proizvajalec) v povezavi z oceno, ki so jo vinu dali enologi (ocena barve, okusa in cvetice vina). - klasifikacija, kemometrijska karakterizacija in primerjava vzorcev različnih mineralnih vod (iz Slovenije in drugih evropskih držav) glede na fizikalne parametre in vsebnosti ionov - prikaz povezave med koncentracijami morfija in njegovih metabolitov v krvnem serumu pri zdravljenju onkoloških bolnikov. Določitev odvisnosti omenjenih koncentracij od predpisanega zdravljenja (dnevna doza, vrsta zdravila), osebnih karakteristik bolnika, spola, lokacije tumorja in izmerjenih biokemijskih parametrov. - izdelava matematičnih modelov in grafični prikaz povezave med izmerjenimi vsebnostmi onesnaževal zraka in istočasno beleženimi atmosferskimi pogoji. V nalogi sem opisala različne metode multivariatne analize podatkov: korelacijsko analizo, analizo variance (ANOVA, MANOVA), analizo grupiranja podatkov, metodo glavnih osi, kanonsko korelacijsko analizo, linearno diskriminantno analizo in logistično regresijo. Te kemometrijske metode smo izvedli s pomočjo različnih programskih orodij kot so programi TeachMe, Statgraphics Plus, Systat, SAS, MS Excel in drugi. UDK: 542:311 (043.3) Ključne besede: kemometrijska klasifikacija, kemometrijska karakterizacija, analiza grupiranja, metoda glavnih osi, kanonska korelacijska analiza, diskriminantna analiza, nevronske mreže, bela vina, vzorci zraka, onkološki podatki, morfij, mineralne vode, bučna olja, prehranski vzorci, okoljski vzorci.

11 X Contents Acknowledgements... V Key Word Documentation... VI Extended Abstract... VI Ključna dokumentacijska informacija... VIII Razširjeni povzetek... VIII Contents... X List of Figures... XIII List of Tables... XVI Chapter I »Introduction«... 2 Chapter II »Overview of some multivariate statistical methods« Introduction to multivariate statistical methods Advantages and disadvantages of multivariate data analyses Description and pre-treatment of data Similarity and dissimilarity of objects Basic and multivariate statistical methods Correlation Analysis (M)ANOVA Cluster Analysis Principal Component Analysis Canonical Correlations Discriminant Analysis Advanced regression analysis, logistic regression Chapter III »Aims« Chapter IV »Chemometric characterisation of pumpkin seed oil samples« Abstract Introduction Methods Analysed samples Instrumental measurements Chemometric methodology Results and discussion Pretreatment of data, feature reduction Spectra of accepted and rejected quality pumpkin seed oils Chemometrical classification of pumpkin seed oils using two 40 and three categories Conclusion... 47

12 Chapter V »Chemometric characterisation of Slovak white varietal wines« Abstract Introduction Methods Wine samples Sensorial Analysis Statistical Analysis Results and discussion Principal Component Analysis Discriminant Analysis ANalysis Of VAriance Conclusion Chapter VI »Chemometric characterisation of mineral water samples« Abstract Introduction Methods Results and discussion Statistical Screening of Data Cluster Analysis Principal Component Analysis Linear Discriminant Analysis Conclusion Chapter VII »Chemometric characterisation of air data samples« Abstract Introduction Methods Results and discussion Correlation Analysis Principal Component Analysis Canonical Correlations Cluster Analysis Conclusion Chapter VIII »Chemometric characterisation of oncological analytical data« Abstract Introduction Methods Results and discussion Correlation Analysis Principal Component Analysis Cluster Analysis Canonical Correlation Analysis Logistic Regression Conclusion XI

13 XII Chapter IX »Outputs and final conclusions« Outputs of this PhD. Thesis Final conclusions Chapter X »References« Chapter XI Publications connected to the doctoral Dissertation Thesis theme Curriculum vitae

14 XIII List of Figures Figure 1-1: How chemometrics relates to other scientific disciplines... 2 Figure 2-1: Four sets of data with the same correlation coefficient of Figure 2-2: The logistic function, with z on the horizontal axis and f(z) on the vertical axis Figure 4-1: A fragment of the correlation table illustrating the correlation coefficient values characteristic for correlations among wavenumbers in FTIR spectra of pumpkin oil samples Figure 4-2: UV Vis spectra of two representative pumpkin seed oils: P126 - representing acceptable quality oils and P58 - representing not acceptable quality oils Figure 4-3: FTIR spectra of two representatives of acceptable and rejected quality pumpkin seed oils, P21 and P158, and P48-2 and P119, respectively. Signal magnitude in the centre of the marked region is in the following order: P21>P158>P48-2>P Figure 4-4: A 100% successful classification of 70 pumpkin seed oil samples categorised into two classes acceptable (good) and not acceptable (bad) quality oils, respectively using UV Vis absorbance signals measured at 52 wavelengths optimally chosen by the backward-selection technique. Ordinal number of the pumpkin seed oil sample is plotted vs. discriminant function 1 (DF1) Figure 4-5: A 100% successful classification of 80 pumpkin seed oil samples categorised into two classes using near IR absorption signals measured at 62 wavelengths optimally chosen by the backward-selection technique. Ordinal number of the pumpkin seed oil sample is plotted vs. DF Figure 4-6: A 98.78% successful classification of 82 pumpkin seed oil samples categorised into two classes using FTIR absorption signals measured at 59 wavelengths optimally chosen by the backward-selection technique. Ordinal number of the pumpkin seed oil sample is plotted vs. DF1. Sample number 59 was misclassified Figure 4-7: A 100% successful classification of 70 pumpkin seed oil samples categorised into three classes according to the sensory quality: Highest score ( excellent ), good score ( satisfactory ) and not acceptable ( bad ), using UV Vis absorbance signals measured at 57 wavelengths optimally chosen by the backwardselection technique. DF2 vs. DF1 is plotted. Sample number 12 is located at the class border Figure 4-8: A 100% successful classification of 80 pumpkin seed oil samples categorised into three classes using near IR absorption signals measured at 61 wavelengths optimally chosen by the backward-selection technique. DF2 vs. DF1 is plotted. Sample number 50 is located at the class border... 44

15 XIV Figure 4-9: Figure 5-1: Figure 5-2: Figure 5-3: Figure 5-4: Figure 5-5: Figure 6-1: Figure 6-2: Figure 6-3: Figure 6-4: A 100% successful classification of 82 pumpkin seed oil samples categorised into three classes using FTIR absorption signals measured at 62 wavelengths optimally chosen by the backward-selection technique. DF2 vs. DF1 is plotted. Sample number 73 is located at the class border The PCA dependence PC2 vs. PC1. Three wine varieties are marked by different symbols. The left cluster of points belongs to vintage 1999, the right one to vintage The PCA dependence PC2 vs. PC1 for two groups of wine samples created by sensorial assessment (total points) and designated by different symbols. The good wines are located mostly at lower PC2 values. The left cluster belongs to vintage 1999, the right one to vintage The plot of the discriminant functions DF2 vs. DF1 exhibiting the classification by the wine variety indicated by the used symbols. A 100 % classification success of 46 wine samples was found The ordinal sample number N vs. DF1 dependence shows the values of the first (and only) discriminant function DF1 for various N. The samples were here renumbered in order to better recognize two classes of wines differing by the year of production, 1999 vs A 100 % classification success for 46 wine samples was achieved by using only two best variables: Ethanol (v13) and ph (v17) or Density (v12) and Total extract (v14) The sample number N vs. DF1 dependence exhibiting the values of the first (and only) discriminant function DF1 for various N. The samples were renumbered in order to better recognize two classes of wines differing by sensorial quality (total points). A 93.5 % classification success for 46 wine samples was achieved using only five best variables: Lactic acid (v8), Glucose (v10), Ethanol (v13), SO 2 total (v2), and Total acids (v3) The plot demonstrating a strong correlation between the sodium content and conductivity. The origin of mineral waters: 1 Slovenia, 2 Czech Republic, 3 Hungary, 4 Germany, 5 the Western Balkan territory Complete linkage clustering (furthest neighbour) of the studied mineral water samples using Euclidean distance metrics PCA loading/loading plot in the PC2 vs. PC1 plane using autoscaled (i.e. standardized) data Linear discriminant analysis of mineral water samples using the plane of discrimination functions DF2 vs. DF1. Selected nine best variables are indicated in the text. The sampling territory is given by numbers used in legend to Figure 6-1. The object O (outlier) belongs to class 1 but was classified into class 5. Class centroids are marked by * symbol... 69

16 XV Figure 7-1: Figure 7-2: Figure 7-3: Figure 8-1: Figure 8-2: Figure 8-3: Figure 8-4: Figure 8-5: Figure 8-6: Loadings plot showing interposition of 12 studied variables (indicated in the text) in the plane of the most important principal components, PC2 vs. PC individual day sampling data were used Canonical correlation analysis of two sets of variables created by the pollutants (chemical factors) (C1p) and meteorological factors (C1m). Four different symbols (and four colours in original picture) in the diagram represent four year seasons: spring March, April, May (spring), summer June, July, August (summer), autumn September, October, November (autumn), winter December, January, February (winter) Cluster analysis of 1215 samples corresponding to individual day sampling using 12 variables Principal component analysis biplot PC2 vs. PC1 representing relations between morphine variables and personal characteristics (including morphine daily dose) of 42 oncological patients (the smaller data set, without the data of patient No. 31) 100 Principal component analysis biplot PC2 vs. PC1 representing relations between the ratio morphine variables and personal characteristics (including morphine daily dose) of 99 oncological patients (the larger data set). PC1 and PC2 gather 54.2 % of information Principal component analysis biplot PC3 vs. PC1 representing relations between the ratio morphine variables and personal characteristics (including morphine daily dose) of 43 oncological patients (the smaller data set). PC1 and PC3 gather 55.3 % of information Dendrogram of Ward s method of hierarchical cluster analysis with Euclidean distances showing the distances among 12 selected variables, which represent personal characteristics of 43 oncological patients (the smaller data set), morphine daily dose, the morphine and corresponding glucuronides serum concentrations and their ratios Dendrogram of Ward s method of hierarchical cluster analysis with Euclidean distances showing the distances among 13 selected variables, which represent personal characteristics of 43 oncological patients, morphine daily dose, the morphine and corresponding glucuronides serum concentrations and their ratios Dendrogram of Ward s method of hierarchical cluster analysis with squared Euclidean distances showing the distances among 23 selected variables, which represent characteristics of 103 oncological patients, morphine daily dose, the morphine and corresponding glucuronides serum concentrations and their ratios

17 XVI List of Tables Table 2-1: Some examples of multivariate statistical methods... 7 Table 2-2: Comparison between univariate and multivariate statistical methods... 9 Table 2-3: Interpretation of the size of a correlation Table 4-1: Success in the LDA classification (training) and the corresponding number of optimally selected variables for six chemometrical models calculated for each kind of the spectrum and two and three preselected pumpkin seed oil classes Table 4-2: Classification of pumpkin oils by sensorial quality defined by two classes using various spectral methods by means of linear discriminant analysis Table 4-3: Classification of pumpkin oils by sensorial quality defined by three classes using various spectral methods by means of linear discriminant analysis Table 5-1: Concentration of chemical compounds and further characteristics used as variables in multivariate data analysis of varietal wines Table 5-2: Table 5-3: Criteria for wine classification and success in the LDA classification when all or the selected best variables were used.. 55 The probability p-values in the ANOVA classification of wines using different classification factors and eighteen variables (v1 to v18) described in Table Table 6-1: Fraction of the total variance in %, contained in principal components, calculated by column standardization (autoscaling) of data Table 6-2: Standardized discriminant function coefficients Table 7-1: Table 8-1: Pearson s correlation coefficients exhibiting the strength of correlation of all pairs of variables with the p- values below them 82 Personal properties, tumour location and kind of treatment of the monitored oncology patients Table 8-2: Concentration of morphine, morphine-3-glucuronide and morphine-6-glucuronide in serum of oncology patients and their ratios Table 8-3: Biochemical data on the monitored oncology patients results of the performed biochemical tests in serum... 96

18 1 Chapter I

19 2 1. Introduction The best way how to introduce chemometrics is to start with a few definitions. This will be useful for two reasons; to make clear some key terms which occur frequently in the literature as well as in this thesis, and to provide a brief look at several concepts which are frequently used in chemometrics. The term chemometrics was established several decades ago to describe a new way of analyzing chemical data. Chemometrics itself can be defined as a chemical discipline that uses mathematics, statistics, graphical or symbolic methods and formal logic (a) to design or select optimal experimental procedures; (b) to provide maximum relevant chemical information which can be extracted from chemical data; and (c) to obtain knowledge about chemical systems. The relationship of chemometrics to different scientific disciplines is indicated in Figure 1-1. On the left there are enabling sciences, mainly mathematical and not laboratory based. Statistical approaches are based on mathematical theory, so statistics falls between mathematics and chemometrics. Computing is important as much as chemometrics relies on software. However, chemometrics is not really a computer science. Engineers, especially chemical and process engineers, have often a need for chemometric methods in many areas of their work. Figure 1-1: How chemometrics relates to other scientific disciplines. On the right there are main disciplines of chemistry that benefit from chemometrics. Analytical chemistry is probably the most significant area since chemometrics plays there an important role and often originates from analytical chemistry. Environmental chemists, biologists, food chemists as well as geochemists, chemical archaeologists, forensic scientists, and so on, depend on good analytical chemistry measurements and many of them routinely use multivariate approaches especially for pattern recognition and need chemometrics to help interpret their data. The organic chemists have different needs for chemometrics, primarily in experimental design (e.g. optimising reaction conditions) and quantitative structure - analysis relationships (QSAR) for drug design. Finally, physical

20 3 chemists such as spectroscopists, kineticists and material scientists often come across the methods of signal deconvolution and multivariate data analysis. The term multivariate analysis, as usually applied by chemometricians, defines any statistical, mathematical or graphical approach which considers multiple variables simultaneously. Chemometric procedures have been proved useful at any step of analysis, from the first conception of an experiment, until the decision upon the studied problem is made. Chemometrics is used to solve problems involving large amounts of data. Within processanalysis and monitoring, chemical analysis, spectroscopy, molecular modelling, sensory analysis and many other fields, large data-tables are obtained that need to be analyzed and visualized in order to properly understand the problem at hand. Chemometric tools can be used for a wide variety of tasks, including exploratory data analysis, experimental design, and the development of predictive models. In the context of analytical chemistry, however, chemometrics has been shown to be most effective for two general functions, namely instrument control using computer algorithms (multivariate calibration models are built in order to provide selectivity for a multivariate analytical instrument), and information extraction (chemometric tools are used to unlock hidden information already present in information-rich multivariate data coming from analytical instruments, to enable improved understanding of chemistry and chemical technology). Chemometrics is involved in the process of producing data and in the extraction of information from these data. If the quality of the measurement processes and therefore quality of the data is not good enough, such information may be uncertain or even wrong. The key to chemometrics is to understand how to perform meaningful calculations on the data. In most cases these calculations are too complex to do by hand or using a calculator, so it is necessary to use some computer program and software. Data analysis is not really a knowledge based subject but more a skill based subject. In chemometrics, although there are quite a number of named methods, the key is not to learn hundreds of equations by heart, but to understand a few basic principles. These ideas occur again and again but in different context. Exploring the field of my work and going through the literature, I found many advantages of chemometrics. Here are just some of them: i) Chemometrics provides speed in obtaining real-time information from data. ii) It allows high quality information to be extracted from unresolved data. iii) It provides clear information resolution and exhibits a discrimination power when applied to the second, third and possibly higher-order data. iv) It provides diagnostics for probability that the information it derives is accurate. v) It helps in improving the measurements. vi) It improves knowledge of the existing processes. vii) It has low capital requirements it is cheap. In summary, it provides the promise of faster, cheaper, and better information with known integrity. The intelligence can replace physical and material solutions as much as the digital chip replaces the mechanical clock work. The perceived disadvantage of chemometrics is that there exists a widespread ignorance about what it is and what it can realistically accomplish. The notion that many people talk about chemometrics but there are relatively few actually using it for daily activities and major problem solving in industrial situations. This science is considered too complex for the average technician and analyst. The mathematics can be misinterpreted as esoteric and not relevant. Most importantly, for the industry, there are a lack of official practices and methods associated with chemometrics. Chemometrics requires a change in one's approach to problem solving from univariate to multivariate thinking since we live in an essentially multivariate context; from pondering over spreadsheets to actually analyzing the data for its full information content. The new

21 4 method looks at all the data from a multivariate approach, whereas the old method requires the scientist's assumed powers of observation from a univariate standpoint. One of the difficulties is to decide what software to employ in order to analyse the data. Users of chemometric methods should select their own software tools, on one hand according to their desires and the tasks which are to be served, and on the other hand, according to personal experience, costs, and the hardware available. Some chemometricians prefer to program their own methods and others will use various statistical packages.

22 5 Chapter II

23 6 2. Overview of some multivariate statistical methods 2.1. Introduction to multivariate statistical methods When the analysis yields a large number of parameters for each object, it is difficult to overview relations between individual variables. Multivariate statistical methods can be applied in such a situation to give an objective description of the data. Multivariate statistical methods are the data analysis techniques often grouped together under the name "multivariate statistics." The word multivariate means these techniques look at the pattern of relationships among several variables simultaneously. Multivariate statistics helps the researcher to summarize data and reduce the number of variables necessary to describe it. Multivariate statistics is most commonly employed (a) for developing taxonomies or systems of classification, (b) to investigate useful ways to conceptualize or group items, (c) to generate hypotheses, and (d) to test hypotheses. According to the Wikipedia web-page, the free encyclopaedia, the word multivariate, is defined as: "having or involving a number of independent mathematical or statistical variables". Multivariate statistics or multivariate analysis describes a collection of procedures which involve observation and analysis of more than one variable at a time. Important methods of multivariate statistics can be divided into four important groups (see Table 2-1), namely i) unsupervised learning methods, ii) supervised learning methods, iii) factorial methods, and iv) correlation and regression analysis. Unsupervised learning is a type of machine learning where manual labels of inputs are not used. When using unsupervised learning methods, grouping of analytical data is possible either by means of clustering methods or by projecting the high dimensional data onto the lower dimensional space. Since there is no supervisor in the sense of known membership of objects to classes, these methods are performed in an unsupervised manner. Supervised learning is a machine learning technique for learning a function from training data. The training consist of pairs of input objects (typically vectors), and desired outputs. The output of the function can be a continuous value (called regression), or can predict a class label of the input object (called classification). The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this, the learner has to generalize from the presented data to unseen situations in a "reasonable" way. Methods of multivariate analysis of variance (MANOVA) extend analysis of variance (ANOVA) to cover cases where there is more than one dependent variable and where the dependent variables cannot simply be combined. Discriminant function or canonical variate analyses attempt to establish whether a set of variables can be used to distinguish between two or more groups. Linear discriminant analysis (LDA) computes a linear predictor from two sets of normally distributed data to allow for classification of new observations.

24 7 Table 2-1: Some examples of multivariate statistical methods Method Unsupervised learning methods * Cluster Analysis (CA) * Display methods: ** NonLinear Mapping (NLM) ** Minimal Spanning Tree (MST) ** Principal Component Analysis (PCA) Supervised learning methods * Multivariate Analysis of Variance (MANOVA) * Soft Independent Modelling of Class Analogy (SIMCA) * Linear Learning Machine (LLM) * Discriminant Analysis (LDA, QDA) * K-th Nearest Neighbours (KNN) * UNEQ classification Factorial methods * Factor Analysis (FA) * Principal Component Analysis (PCA) * Canonical Correlation Analysis (CCA) Correlation and regression analysis * with direct variables * with latent variables Solving the problem Finding structures/similarities (groups, classes) in the data Quantitative demarcation of a priori classes, relationships between the class properties and variables Finding factors Quantitative description of the relationships between variables There are two kinds of methods in multidimensional statistics: Factorial Methods, in which the data are projected on a vector space, trying to lose as little information as possible; and Classification Methods, that try to cluster and classify the investigated objects. Factorial methods are aimed at projecting the original data set from a high dimensional space onto a line, a plane, or a three-dimensional coordinate system. There are mathematical procedures that allow the rotation of the data space into all possible directions and stop this process when the best projection, i.e. optimal clustering of the data groups has been found. At present, in the field of chemometrics, data projection is performed mainly by principal component analysis (PCA), factor analysis (FA), and others (more rarely). The different methods are linked to different science areas. They also differ mathematically in the way the projection is computed. Principal component analysis attempts to determine a smaller set of synthetic variables that could explain the original variable set. Canonical correlation analysis tries to establish whether or not there are linear relationships between two sets of variables. The basic purpose behind correlation is to find out if two variables are mutually related. If the variables are related, regression then allows the use of the relationship in the prediction of one variable given a score on the other variable. Regression analysis involves identifying the relationship between a dependent variable and one or more independent variables. A model of the relationship is hypothesized, and estimates the parameter values, which are used to develop an estimated regression equation. Various tests are then employed to determine if the model is satisfactory. If the model is deemed satisfactory, the estimated regression equation can

25 8 be used to predict the value of the dependent variable given the values of the independent variables. In simple linear regression, the model used to describe the relationship between a single dependent variable y and a single independent variable x is shown in Equation 2-1. y = a0 + a1 x + k (2-1) where a 0 and a 1 are referred to as the model parameters and k is a probabilistic error term that accounts for the variability in y that cannot be explained by the linear relationship with respect to x. If the error term were not present, the model would be deterministic; in that case, knowledge of the value of x would be sufficient to determine the value of y. Either a simple or multiple regression model is initially posed as a hypothesis concerning the relationship among the dependent and independent variables. The least squares method is the most widely used procedure for developing estimates of the model parameters. Correlation and regression analysis are related in the sense that both deal with the relationships among variables. The correlation coefficient is a measure of linear association between two variables. The values of the correlation coefficient are always between 1 and +1. A correlation coefficient of +1 indicates that two variables are perfectly related in a positive linear sense; a correlation coefficient of 1 indicates that two variables are perfectly related in a negative linear sense, and the correlation coefficient of 0 indicates that there is no linear relationship between the two variables. For simple linear regression, the sample correlation coefficient is the square root of the coefficient of determination, with the sign of the correlation coefficient being the same as the sign of a 1, the coefficient expressing the slope in the estimated regression equation. Neither regression nor correlation analyses can be interpreted as establishing causeand-effect relationships. They can indicate only how or to what extent the variables are associated with each other. The correlation coefficient measures only the degree of linear association between two variables. Any conclusions about a cause-and-effect relationship must be based on the judgment of the analyst. Regression analysis attempts to determine a linear formula that can describe how some variables respond to the changes in others. Logistic regression allows regression analysis to estimate and test the influence of covariates on a binary response. Non-linear regression analysis and artificial neural networks extend regression methods to non-linear multivariate models Advantages and disadvantages of multivariate data analyses Analyses of data are divided into two groups according to the number of considered variables, namely univariate and multivariate statistics. Univariate statistics considers only one dependent variable at a time, i.e. the sample mean, t-test, or ANOVA. On the other hand, multivariate statistics consider more than one dependent variable at a time; namely a set of dependent variables in multidimensional space, and account for

26 9 relationships among the dependent variables as well as the relationships between independent and dependent ones. A multivariate approach does not supersede univariate ones. It is best seen as a complement that allows reduction of redundancy. In addition, certain outliers are never revealed in a univariate context because they are by their very nature multivariate. A curtailed comparison between univariate and multivariate statistics is shown in Table 2-2. Table 2-2: Comparison between univariate and multivariate statistical methods Univariate statistics Multivariate statistics Advantage Advantages * Easy to use and monitor * Only one chart has to be monitored * Reduction of false alarms * Correlation between parameters is considered Disadvantage Disadvantage * Often misleading because the correlation * Influence of single parameters cannot be between parameters is ignored observed easily Monitoring Monitoring * e.g. by Shewhart charts * Summarization of multiple parameters into a single chart Multivariate methods are based on multiple linear regression, which is a multiple variable approach using the principles of simple linear regression. Multivariate methods have the advantage of bringing in more information to bear on a specific outcome. They allow one to take into account the continuing relationships among several variables. This is especially valuable in observational studies where total control is never possible. The specific advantages of multivariate studies are as follows: They resemble closely how the researcher thinks about the data. They allow easier visualisation and interpretation of the data. More data can be analyzed simultaneously, thereby providing greater statistical power. Regression models can give more insight into relationships between variables (relationships between variables is understood better). The focus is on relationships among variables rather than on isolated individual factors.

27 Description and pre-treatment of data Data preparation involves checking or logging the data in; checking the data for accuracy; entering the data into the computer; transforming the data; and developing and documenting a database structure that integrates various measures. Logging the data In any research project you may have data coming from a number of different sources at different times: coded interview data, pre-test or post-test data, and observational data. In all but the simplest of studies, you need to set up a procedure for logging the information and keeping track of it until you are ready to do a comprehensive data analysis. Different researchers differ in how they prefer to keep track of incoming data. In most cases, you will want to set up a database that enables you to assess at any time what data is already in and what is still outstanding. It is also critical that the data analyst retain the original data records for a reasonable period of time - returned surveys, field notes, test protocols, and so on. Most professional researchers will retain such records for at least 5-7 years. For important or expensive studies, the original data might be stored in a data archive. The data analyst should always be able to trace a result from a data analysis back to the original forms on which the data was collected. A database for logging incoming data is a critical component in good research record-keeping. Checking the data for accuracy As soon as data is received you should screen it for accuracy. In some circumstances doing this right away will allow you to go back to the sample to clarify any problems or errors. As part of this initial data screening it is important that the responses are readable, all important questions are answered, the responses are complete, all relevant contextual information are included (e.g., data, time, place, researcher). In most social research, quality of measurement is a major issue. Assuring that the data collection process does not contribute inaccuracies will help assure the overall quality of subsequent analyses. Developing a database structure The database structure is the manner in which you intend to store the data for the study so that it can be accessed in subsequent data analyses. You might use the same structure you used for logging in the data or, in large complex studies, you might have one structure for logging data and another for storing it. There are generally two options for storing data on computer - database programs and statistical programs. In every research project, it is very handy to generate a file that describes (codes) the data and indicates where and how it can be accessed. Minimally the file should include the following items for each variable: variable name, variable description, variable format (number, data, text), instrument/method of collection, date collected, respondent or group, variable location (in database), and notes. These information are an indispensable tool for the analysis team. Together with the database, it should provide comprehensive documentation that enables other researchers who might subsequently want to analyze the data to do so without any additional information.

28 11 Entering the data into the computer There are a wide variety of ways to enter the data into the computer for analysis. Probably the easiest is to just type the data in directly. In order to assure a high level of data accuracy, the analyst should use a procedure called double entry. However, these double entry programs are not widely available and require some training. An alternative is to enter the data once and set up a procedure for checking the data for accuracy. For instance, you might spot check records on a random basis. Data transformations Once the data have been entered it is almost always necessary to transform the raw data into variables that are usable in the analyses. There are a wide variety of transformations that you might perform. They consider: Missing values - many analysis programs automatically treat blank values as missing. In others, either you should omit some measurements (which is not favourable because you loose information) or you need to designate specific values to represent missing values. For instance, you might use a value of 99 to indicate that the item is missing. You need to check the specific program you are using to determine how to handle missing values. Item reversals - on scales and surveys, we sometimes use reversal to help reduce the possibility of a response set. When you analyze the data, you want all scores for scale items to be in the same direction where high scores mean the same thing and low scores mean the same thing. In these cases, you have to reverse the ratings for some of the scale items. Scale totals - once you have transformed any individual scale items you will often want to add or average across individual items to get a total score for the scale. Categories - for many variables you will want to collapse them into categories. For instance, you may want to collapse income estimates (in dollar amounts) into income ranges. Data cleaning - checks must be performed thoroughly and extensively in order to ensure consistency and treatment of missing responses. A consistency check is a part of the data cleaning process that identifies data that is out of range, logically inconsistent, or have extreme values. Casewise deletion - a method for handling missing responses in which cases or respondents with any missing responses are discarded from the analysis. Pairwise deletion - a method of handling missing values in which all cases, or respondents, with any missing values are not automatically discarded, rather, for each calculation only the cases or respondents with complete responses are considered. Weighting - a statistical adjustment to the data in which each case or respondents in the database is assigned a weight to reflect its importance relative to other cases or respondents. Variable respecification - the transformation of data to create new variables or the modification of existing variables set that they are more consistent with the objectives of the study. Scale transformation - is a manipulation of scale values to ensure comparability with other scales or otherwise make the data suitable for analysis. Standardization - is the process of correcting data to reduce them to the same scale by subtracting the sample mean and dividing by the standard deviation.

29 Similarity and dissimilarity of objects In multivariate statistics the definition of object/variable similarities and object/variable dissimilarities is very important and needed. Similarities among objects can be expressed with help of (a) distance measure, (b) association coefficients, and (c) correlation coefficients. A short distance corresponds to high similarity level. Association and correlation coefficients should be considered otherwise a high value of the correlation coefficient means a similarity and the opposite meaning has the value zero. In further text these definitions will be explained more in detail. The selection of the distance measure is an important step in any clustering method, which will determine how the similarity of two elements is calculated. This will influence the shape of the clusters, as some elements may be close to one another according to one distance and further away according to another. For example, in a 2-dimensional space, the distance between the point (x=1, y=0) and the origin (x=0, y=0) is always 1 according to the usual norms but the distance between the point (x=1, y=1) and the origin can be 2, 2 or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance. Commonly used distance functions are the Manhattan distance (also called City block distance), the Mahalanobis distance (corrects data for different scales and correlations in the variables), the Hamming distance (sometimes edit distance; measures the minimum number of substitutions required to change one member into another), and the Euclidean distance (also called distance as the crow flies or 2-norm distance). The Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, which can be proven by repeated application of the Pythagorean Theorem. The Euclidean distance between points P=(p 1, p 2,, p n ) and Q=(q 1, q 2,, q n ), in Euclidean n-space, are defined by Equation 2-2. d = (p q ) + (p 2 q 2 2 ) (p n 2 2 n qn ) = (pi qi ) i= 1 (2-2) For 1-dimensional points P=(p x ) and Q=(q x ) and 2-dimensional points P=(p x, p y ) and Q=(q x, q y ), the distance is computed as shown by Equations 2-3 and 2-4, respectively. d = (p x q x ) 2 = p x q x (2-3) where the absolute value signs are used since distance is normally considered to be an unsigned scalar value. d = (p x q x 2 ) + (p y q y ) 2 (2-4)

30 13 The most common distance measure in published studies is the Euclidean distance or the squared Euclidean distance. A statistical technique often used in information visualization for exploring similarities or dissimilarities in data is called multidimensional scaling (MDS). Association coefficients consist of a number used to indicate the degree to which two variables or objects are related. Two basic types are covariation measures and dissimilarity measures. Co-variation measures (such as the Pearsonian product-moment correlation, r) are based upon the product (multiplication) of the data values, and indicate the extent to which the variables are associated (0 = none and 1 = perfect), and the direction in which the variables vary with each other (positive, where one variable increases with another, or negative where one variable decreases as the other increases). Association measures should be chosen by reference to the level of measurement of the variables involved, and most measures of association retain the same value if the data values are legitimately transformed in accord with that level. Dissimilarity measures cover both similarity (or proximity) measures, where a high value means considerable likeness between the variables, and dissimilarity measures where a high value denotes considerable difference. A measure of the interdependence of two random variables that ranges in value from 1 to +1, indicating perfect negative correlation at 1, absence of correlation at zero, and perfect positive correlation at +1 is called coefficient of correlation Basic and multivariate statistical methods Statistical methods can be used to summarize or describe a collection of data; this is called descriptive statistics. In addition, patterns in the data may be modelled in a way that accounts for randomness and uncertainty in the observations, and then used to draw inferences about the process or population being studied; this is called inferential statistics. Both descriptive and inferential statistics comprise applied statistics. For practical reasons, rather than compiling data about an entire population, one usually studies a chosen subset of the population. After the data are collected they are subjected to statistical analysis, which serves to two related purposes: description and inference. Descriptive statistics can be used to summarize the data, either numerically or graphically, to describe the sample. Basic examples of numerical descriptors include the mean and standard deviation. Graphical summarizations include various kinds of charts and graphs. Inferential statistics is used to model patterns in the data, accounting for randomness and drawing inferences about the larger population. These inferences may take the form of answers to yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), descriptions of association (correlation), or modelling of relationships (regression). Other modelling techniques include ANOVA, time series, and data mining. Some well known statistical tests and procedures used in chemometrics for research observations are explained bellow.

31 Correlation analysis. In probability theory and statistics, correlation, measured by the correlation coefficient, indicates the strength and direction of a linear relationship between two random variables. In general statistical usage, correlation or co-relation refers to the departure of two variables from independence. In this broad sense there are several coefficients, measuring the degree of correlation, adapted to the nature of data. A number of different coefficients are used for different situations. The best known is the Pearson product-moment correlation coefficient, which is obtained by dividing the covariance of the two variables by the product of their standard deviations. The correlation coefficient ρ X, Y between two random variables X and Y with expected values µ X and µ Y and standard deviations σ X and σ Y is defined as shown in Equations 2-5 and 6, where E is the expected value operator and cov means covariance. ρ X,Y cov (X,Y) E((X-µ )(Y - µ X Y = = (2-5) σ XσY σ XσY )) Since µ X = E(X), σ X 2 = E(X 2 ) E 2 (X) and likewise for Y, we may also write E(XY)-E(X) E(Y) = (2-6) E(X )-E (X) E(Y )-E (Y) ρ X,Y 2 The correlation is defined only if both of the standard deviations are finite and both of them are nonzero. The correlation coefficient cannot exceed 1 in absolute value. The correlation is 1 in the case of an increasing linear relationship, 1 in the case of a decreasing linear relationship, and some value in between in all other cases, indicates the degree of linear dependence between the variables. The closer the coefficient is to either 1 or 1, the stronger the correlation between the variables. If the variables are independent then the correlation is 0 but the converse is not true because the correlation coefficient detects only linear dependences between two variables. If we have a series of n measurements of X and Y written as x i and y i where i = 1, 2,..., n, then the Pearson product-moment correlation coefficient can be used to estimate the correlation of X and Y. The Pearson coefficient is also known as the "sample correlation coefficient". The Pearson correlation coefficient is then the best estimate of the correlation of X and Y. The Pearson correlation coefficient is written as: r x,y = = x y - nxy (n - i i i i i i 1)sxs y n xi - ( xi ) n yi - ( yi ) (xi - x)(yi - y) (n - 1)s s x y = n x y - x y (2-7)

32 15 Where x and y are the sample means of X and Y, s x and s y are the sample standard deviations of X and Y and the sum is from i = 1 to n. Also, as is true with the population correlation, the absolute value of the sample correlation must be less than or equal to 1. The square of the sample correlation coefficient, which is also known as the coefficient of determination, is the fraction of the variance in y i that is accounted for by a linear fit of x i to y i. This is shown in Equation y x s 2 r x,y = 1 - (2-8) s 2 y Where s 2 y x is the square of the error of a linear regression of x i on y i by the equation y=a+bx and can be expressed as shown in Equation 2-9, and (see Equation 2-10). n 2 1 s y x = (yi - a - bxi ) n - 1 i= s y is just the variance of y (2-9) 2 1 n s y = (yi - y) n - 1 i= 1 2 (2-10) Several authors have offered guidelines for the interpretation of a correlation coefficient (see Table 2-3) [Cohen J. 1988]. As Cohen himself has observed, however, all such criteria are in some ways arbitrary and should not be implemented too strictly. This is because the interpretation of the correlation coefficient depends on the context and purposes. A correlation of 0.9 may be very low if one is verifying a physical law using high-quality instruments but may be regarded as very high in the social sciences where there may be a greater contribution from complicating factors. It is important to remember that "large" and "small" should not be taken as synonyms for "good" and "bad" in terms of determining that the correlation is of a certain size. Table 2-3: Interpretation of the size of a correlation Correlation Negative Positive Small to to 0.3 Medium to to 0.5 Large to to 1.0 Pearson's correlation coefficient is a parametric statistic. Parametric statistics are statistics where the population is assumed to fit any parameterized distributions (most typically the normal distribution). When distributions are not normal this kind of statistics may be less useful than non-parametric correlation methods, such as Chi-square, Point

33 16 bi-serial correlation, Spearman's ρ, Kendall's τ, and Goodman and Kruskal's λ. They are a little less powerful than parametric methods if the assumptions underlying the latter are met but are less likely to give distorted results when the assumptions fail. The results of correlation analysis can be presented in a correlation table. A correlation matrix of n random variables X 1,..., X n is the n n matrix whose i, j entry is corr(x i, X j ). If the measures of correlation used are product-moment coefficients, the correlation matrix is the same as the covariance matrix of the standardized random variables X i /SD(X i ) for i = 1,..., n. Consequently it is necessarily a positive-semi-definite matrix. The correlation matrix is symmetric because the correlation between X i and X j is the same as the correlation between X j and X i. Correlation and linearity While Pearson correlation indicates the strength of a linear relationship between two variables, its value alone may not be sufficient to evaluate this relationship, especially in the case where the assumption of normality is incorrect (see Figure 2.1). Figure 2-1 shows scatterplots of the Anscombe's quartet, a set of four different pairs of variables created by Anscombe [Anscombe F.J. 1973]. The four y variables have the same mean (7.5), standard deviation (4.12), correlation (0.81) and regression line (y = x). Figure 2-1: Four sets of data with the same correlation coefficient of However, as can be seen on the plots, the distribution of the variables is very different. The first one (top left) seems to be distributed normally, and corresponds to what one

34 17 would expect when considering two variables correlated and following the assumption of normality. The second one (top right) is not distributed normally; while an obvious relationship between the two variables can be observed, it is not linear, and the Pearson correlation coefficient is not relevant. In the third case (bottom left), the linear relationship is perfect, except for one outlier which exerts enough influence to lower the correlation coefficient from 1 to Finally, the fourth example (bottom right) shows another example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear. These examples indicate that the correlation coefficient, as a summary statistic, cannot replace the individual examination of the data (M)ANOVA. The purpose of analysis of variance (ANOVA) is to test differences in means (for groups or variables) for statistical significance. This is accomplished by analyzing the variance, that is, by partitioning the total variance into the component that is due to true random error (i.e., within- group SS) and the components that are due to differences between means. These latter variance components are then tested for statistical significance, and, if significant, we reject the null hypothesis of no differences between means, and accept the alternative hypothesis that the means (in the population) are different from each other. The variables that are measured are called dependent variables. The variables that are manipulated or controlled (e.g., a teaching method or some other criterion used to divide observations into groups that are compared) are called factors or independent variables. ANOVA is a flexible and powerful technique that can be applied to many complex research issues. In ANOVA we can test each factor while controlling for all others; this is actually the reason why ANOVA is more statistically powerful (i.e., we need fewer observations to find a significant effect) than the simple t-test. Another advantage of ANOVA over simple t-tests is that ANOVA allows us to detect interaction effects between variables, and, therefore, to test more complex hypotheses about reality. In case of more than one dependent variable we can perform a multivariate analysis of variance (MANOVA). Instead of a univariate F value, we would obtain a multivariate F value (Wilks' lambda) based on a comparison of the error variance/covariance matrix and the effect variance/covariance matrix. The "covariance" here is included because the two measures are probably correlated and we must take this correlation into account when performing the significance test. If we were to take the same measure twice, then we would really not learn anything new. If we take a correlated measure, we gain some new information but the new variable will also contain redundant information that is expressed in the covariance between the variables. If the overall multivariate test is significant, we conclude that the respective effect is significant.

35 Cluster Analysis. The term cluster analysis (first used by Tryon, 1939) joins a number of different algorithms and methods for grouping objects of similar kind into respective categories. A general question facing researchers in many areas of inquiry is how to organize the observed data into meaningful structures, that is, to develop taxonomies. In other words, cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise. Given the above, cluster analysis can be used to discover structures in data without providing an explanation/interpretation. In other words, cluster analysis simply discovers structures in data without explaining why they exist. The above discussions refer to clustering algorithms and do not mention anything about statistical significance testing. In fact, cluster analysis is not as much a typical statistical test as it is a "collection" of different algorithms that "put objects into clusters according to well defined similarity rules." The point here is that, unlike many other statistical procedures, the cluster analysis methods are mostly used when we do not have any a priori hypotheses but are still in the exploratory phase of our research. In a sense, cluster analysis finds the "most significant solution possible." Therefore, statistical significance testing is really not appropriate here, even in cases when p-levels are reported (as in k-means clustering). Clustering techniques have been applied to a wide variety of research problems. In general, whenever one needs to classify a lot of information into manageable meaningful piles, cluster analysis is of great utility. The goal of the joining or tree clustering algorithm is to join objects together into successively larger clusters, using some measure of similarity or distance. A typical result of this type of clustering is the hierarchical tree (dendrogram); more and more objects are linked together and larger and larger clusters of increasingly dissimilar elements are aggregated (amalgamated) until, finally, in the last step, all objects are joined together. In these plots, the horizontal axis denotes the linkage distance. Thus, for each node in the graph (where a new cluster is formed) the criterion distance can be read off, at which the respective elements were linked together into a new single cluster. When the data contain a clear "structure" in terms of clusters of objects that are similar to each other, then this structure will often be reflected in the hierarchical tree as distinct branches. As the result of a successful analysis with the joining method, one is able to detect clusters (branches) and interpret those branches. The joining or tree clustering method uses the dissimilarities (similarities) or the distances between objects when forming the clusters. Similarities are a set of rules that serve as criteria for grouping or separating items. These similarities (distances) can be based on a single dimension or multiple dimensions, with each dimension representing a rule or condition for grouping objects. The most straightforward way of computing distances between objects in a multi-dimensional space is to compute Euclidean distances. In a two- or three-dimensional space this measure is the actual geometric distance between the objects in the space (i.e., as if measured with a ruler). However, the joining algorithm does not "care" whether the distances that are "fed" to it are actual real distances, or some other derived measure of distance that is more meaningful to the researcher; and it is up to the researcher to select the right method for his/her specific application. Bellow, different distance measurements used in cluster analysis are introduced.

36 19 Euclidean distance This is probably the most commonly chosen type of distance. It is simply the geometric distance in the multidimensional space. It is computed as shown in Equations 2, 3 and 4. Euclidean (and squared Euclidean) distances are usually computed from raw data, and not from standardized data. This method has certain advantages, e.g., the distance between any two objects is not affected by the addition of new objects to the analysis, which may be outliers. However, the distances can be greatly affected by differences in scale among the dimensions from which the distances are computed. For example, if one of the dimensions denotes a measured length in centimetres and it is then converted to millimetres (by multiplying the values by 10), the resulting Euclidean or squared Euclidean distances (computed from multiple dimensions) can be greatly affected (i.e., biased by those dimensions which have a larger scale), and consequently, the results of cluster analyses may be very different. Generally, it is good practice to transform the dimensions so they have similar scales. Squared Euclidean distance One may want to square the standard Euclidean distance in order to place progressively greater weight on the objects that are further apart. This distance is computed as shown in Equation 2-11: ( x i y ) 2 distance (x,y) = i i (2-11) City-block (Manhattan) distance This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). The city-block distance is computed as shown in Equation 2-12: distance (x,y) = i x i y i (2-12) Chebyshev distance This distance measure may be appropriate in cases when one wants to define two objects as "different" if they are different on any one of the dimensions. The Chebyshev distance is computed as shown in Equation 2-13: distance (x,y) = Maximum x i y i (2-13) Power distance Sometimes one may want to increase or decrease the progressive weight that is placed on dimensions on which the respective objects are very different. This can be accomplished via the power distance. The power distance is computed as shown in Equation 2-14:

37 20 r p distance (x,y) = i (2-14) x i y i Where r and p are user-defined parameters. Parameter p controls the progressive weight that is placed on differences on individual dimensions; parameter r controls the progressive weight that is placed on larger differences between objects. If r and p are equal to 2, then this distance is equal to the Euclidean distance. Percent disagreement This measure is particularly useful if the data for the dimensions included in the analysis are categorical in nature. This distance is computed as shown in Equation 2-15: Number of xi yi distance (x,y) = (2-15) i Formation of new clusters At the first step, when each object represents its own cluster, the distances between those objects are defined by the chosen distance measure. However, once several objects have been linked together, the distances between the new clusters have to be determined. In other words, a linkage rule is needed to determine when two clusters are sufficiently similar to be linked together. There are various possibilities: for example, we could link two clusters could be linked together when any two objects in the two clusters are closer together than the respective linkage distance. Put another way, we use the "nearest neighbours" across clusters to determine the distances between clusters; this method is called single linkage. This rule produces "stringy" types of clusters, that is, clusters "chained together" by only single objects that happen to be close together. Alternatively, we may use the neighbours across clusters that are furthest away from each other; this method is called complete linkage. There are numerous other linkage rules such as these that have been just proposed. Single linkage (nearest neighbour) As described above, in this method the distance between two clusters is determined by the distance of the two closest objects (nearest neighbours) in the different clusters. This rule will, in a sense, string objects together to form clusters, and the resulting clusters tend to represent long "chains." Complete linkage (furthest neighbour) In this method, the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbours"). This method usually performs quite well in cases when the objects actually form naturally distinct "clumps."; if the clusters tend to be somehow elongated or of a "chain" type nature, then this method is inappropriate.

38 21 Unweighted pair-group average In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. This method is also very efficient when the objects form natural distinct groups, however, it performs equally well with elongated, "chain" type clusters. Weighted pair-group average This method is identical to the un-weighted pair-group average method, except that in the computations, the size of the respective clusters (i.e., the number of objects contained in them) is used as a weight. Thus, this method (rather than the previous method) should be used when the cluster sizes are suspected to be greatly uneven. Unweighted pair-group centroid The centroid of a cluster is the average point in the multidimensional space defined by the dimensions. In a sense, it is the centre of gravity for the respective cluster. In this method, the distance between two clusters is determined as the difference between centroids. Weighted pair-group centroid (median) This method is identical to the previous one, except that weighting is introduced into the computations to take into consideration the differences in cluster sizes (i.e., the number of objects contained in them). Thus, when there are (or one suspects there to be) considerable differences in cluster sizes, this method is preferable to the previous one. Ward's method This method is distinct from all other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In short, this method attempts to minimize the Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at each step. In general, this method is regarded as very efficient; however, it tends to create clusters of small size Principal component analysis. Principal component analysis is appropriate when one have obtained measures on a number of observed variables and wish to develop a smaller number of artificial variables (called principal components) that will account for most of the variance in the observed variables. The principal components may then be used as predictor or criterion variables in subsequent analyses. Principal component analysis is a variable reduction procedure. It is useful when one have obtained data on a number of variables (possibly a large number of variables), and believe that there is some redundancy in those variables. In this case, redundancy means that some of the variables are correlated with one another, possibly because they are measuring the same construct. Because of this redundancy, it is believed that it should be

39 22 possible to reduce the observed variables into a smaller number of principal components (artificial variables) that will account for most of the variance in the observed variables. A principal component can be defined as a linear combination of optimally-weighted observed variables. In order to understand the meaning of this definition, it is necessary to first describe how subject scores on a principal component are computed. In the course of performing a principal component analysis, it is possible to calculate a score for each subject on a given principal component. The subject s actual scores are optimally weighted and then summed to compute their scores on a given component. Below (in Equation 2-16) is the general form for the formula to compute scores on the first component extracted (created) in a principal component analysis: C1 = b11 X1 + b12 X 2 + b13 X b1 p X p (2-16) where C 1 is the subject s score on principal component 1 (the first component extracted), b 1p is the regression coefficient (or weight) for observed variable p, as used in creating principal component 1, and X p is the subject s score on observed variable p. A different equation, with different regression weights, would be used to compute subject scores on component 2 (Equation 2-17). C2 = b21 X1 + b22 X 2 + b23 X b2 p X p (2-17) To determine the regression weights from the preceding equations a special type of equation called an eigenequation is used. The weights produced by this eigenequation are optimal weights in the sense that, for a given set of data, no other set of weights could produce a set of components that are more successful in accounting for variance in the observed variables. The weights are created so as to satisfy a principle of least squares similar (but not identical) to the principle of least squares used in multiple regression. So, the principal component is defined as a linear combination of optimally weighted observed variables. The words linear combination refers to the fact that scores on a component are created by adding together scores on the observed variables being analyzed. Optimally weighted refers to the fact that the observed variables are weighted in such a way that the resulting components account for a maximal amount of variance in the data set. The number of components extracted in the principal component analysis is equal to the number of observed variables being analyzed. However, in most analyses, only the first few components account for meaningful amounts of variance, so only these first few components are retained, interpreted, and used in subsequent analyses. The first component extracted in the principal component analysis accounts for the maximal amount of total variance in the observed variables. Under typical conditions, this means that the first component will be correlated with at least some of the observed variables. However, it may be correlated with many. The second component extracted will have two important characteristics. First, this component will account for a maximal amount of variance in the data set that was not accounted for by the first component. Again under typical conditions, this means that the second component will be correlated with some of the observed variables that did not display strong correlations with

40 23 component 1. The second characteristic of the second component is that it will be uncorrelated with the first component. Literally, if you were to compute the correlation between components 1 and 2, that correlation would be zero. The remaining components that are extracted in the analysis display the same two characteristics: each component accounts for a maximal amount of variance in the observed variables that was not accounted for by the preceding components, and is uncorrelated with all of the preceding components. The principal component analysis proceeds in this fashion, with each new component accounting for progressively smaller and smaller amounts of variance - this is why only the first few components are usually retained and interpreted. When the analysis is complete, the resulting components will display varying degree of correlation with the observed variables but are completely uncorrelated with one another. To understand the meaning of total variance as it is used in the principal component analysis, it is worth mentioning that the observed variables are standardized in the course of the analysis. This means that each variable is transformed so that it has the zero mean and the variance of one. The total variance in the data set is simply the sum of the variances of the observed variables. Because they have been standardized to have the variance of one, each observed variable contributes by one unit of variance to the total variance in the data set. Because of this, the total variance in the principal component analysis will always be equal to the number of the studied observed variables Canonical Correlations. A canonical correlation is the correlation of two canonical (latent) variables, one representing a set of independent variables, the other a set of dependent variables. It must be stressed that such an indication of dependent and independent variables is only formal. Each set may be considered a latent variable based on the measured indicator variables in its set. The canonical correlation is optimized such that the linear correlation between the two latent variables is maximized. The purpose of canonical correlation is to explain the relation of the two sets of variables, not to model the individual variables. For each canonical variate we can also assess how strongly it is related to the measured variables in its own set, or the set for the other canonical variate. Wilks' lambda is commonly used to test the significance of canonical correlation. Some key concepts and terms of canonical correlation analysis are explained below. Canonical variable or variate A canonical variable, also called a variate, is a linear combination of a set of original variables in which the within-set correlation has been controlled (that is, the variance of each variable accounted for by other variables in the set has been removed). It is a form of latent variable. There are two canonical variables per canonical correlation (function). One is the dependent canonical variable, while the one for the independents may be called the covariate canonical variable. Canonical correlation It is also called a characteristic root, is a form of correlation relating two sets of variables. As with factor analysis, there may be more than one significant dimension (more than one canonical correlation), each representing an orthogonally separate pattern

41 24 of relationships between the two latent variables. The maximum number of canonical correlations between two sets of variables is the number of variables in the smaller set. The first canonical correlation is always the one which explains most of the relationship. Some researchers, when reporting canonical correlation, report just the first canonical correlation but it is recommended that all meaningful and interpretable canonical correlations are reported. Canonical weight It is also called the canonical function coefficient or the canonical coefficient. The standardized canonical weights are used to assess the relative importance of individual variables' contributions to a given canonical correlation. The canonical coefficients are the standardized weights in the linear equation of variables which creates the canonical variables. The ratio of canonical weights is the ratio of the contribution of the variable to the given canonical correlation, controlling for other variables in the equation. There will be one canonical coefficient for each original variable in each of the two sets of variables, for each canonical correlation. Thus for the dependent set, if there are five variables and there are three canonical correlations (functions), there will be 15 canonical coefficients in three sets of five coefficients. Canonical scores They are the values on a canonical variable for a given case, based on the canonical coefficients for that variable. Canonical coefficients are multiplied by the standardized scores of the cases and summed to yield the canonical scores for each case in the analysis. A canonical correlation plot The canonical correlation plot shows one (usually the first) canonical correlation. The X axis is the covariate canonical variable (the latent variable for the set of independents). The Y axis is the canonical variable representing the dependent variables. The scales on the axes are in standardized units. The 0.0 location is in the centre of the plot. The points are the canonical scores of each case based on the case's scores on each of the two canonical variables. A regression line shows the scatter of points. When canonical correlation is high, the points will form two clusters at different points on the regression line. Outliers Cases outside the two clusters in a canonical correlation plot are outliers. Canonical correlation plots are a useful method of identifying outliers or exceptional cases which differ from other cases in not sharing the same pattern of correlation among the two sets of variables in the study. Significance tests Wilks lambda is used to test the significance of the first canonical correlation. If p 0. 05, the two sets of variables are significantly associated by canonical correlation. Degrees of freedom equal P Q, where P is the number of variables in variable set 1 and

42 25 Q is the number of variables in variable set 2. This test establishes the significance of the first canonical correlation but not necessarily the second. This is also called the greatest characteristic root test. Likelihood ratio test is a significance test of all sources (not just the first canonical correlation) of linear relationship between the two canonical variables. It is sometimes wrongly used as a test of the significance of the first or another single canonical correlation in a set of such functions. Assumptions Linearity of relationships is assumed. In the usual form of canonical correlation, however, analysis is performed on the correlation or variance-covariance matrices, which reflect linear relationships. One can insert exponentiated or otherwise nonlinearly transformed variables into either measured variable set in canonical correlation. Minimal measurement error is assumed since low reliability attenuates the correlation coefficient. Canonical correlation also can be quite sensitive to missing data. Outliers can substantially affect the canonical correlation coefficients, particularly if sample size is not very large Discriminant analysis. The purpose of discriminant analysis is to classify objects (people, customers, things, etc.) into two or more groups based on a set of features that describe the objects (e.g. gender, age, income, weight, preference score, etc. ). In general, based on the observations made on an object, the object is assigned to one of a number of the predetermined groups. The groups are known or predetermined and do not have any order (i.e. the discrimination criterion is a nominal categorical variable). When performing discriminant analysis two purposes are followed: the first one is feature selection and the second one is classification. Based on a set of observations, for which the classes are known, a model is built. This set of observations is usually referred to as the training set. Based on the training set, the technique constructs a set of linear functions of the predictors, known as discriminant functions (see Equation 2-18) that are used to predict the class of a new observation with an unknown class. For a k class problem k-1 discriminant functions are constructed. D = 2 b1 x1 + b2 x bn xn + c (2-18) Where the b's are discriminant coefficients, the x's are the input variables or predictors and c is a constant. Multivariate discriminant analysis (MDA) tries to find the combination of variables that best predicts the category or group to which a case belongs. The group identification must be known for each case used in the analysis. The combination of predictor variables is called a classification function, and this function can then be used to classify new cases whose group membership is unknown. The MDA only considers complete cases (no

43 26 missing nor out of range data) and the presence of outliers can severely affect the analysis. Two ways of the MDA analysis are mostly considered, namely the linear and the quadratic discriminant analysis. The reason for this diversity is that discriminant methods are very much affected by the nature of the within group covariance matrices. On linear discrimination technology, the covariance matrices are assumed equal for all groups. On quadratic discrimination technology, the assumption of homogeneity of within groups is patently transgressed, although it retains the distributional assumption of the multivariate normality Advanced regression analysis, logistic regression Regression analysis involves finding the line of best fit through a series of points. The calculation to find the line of best fit uses the least squares method and the outcome is a regression equation. The regression equation is the equation that shows the relation between the inputs and the output that is created by regression analysis. Regression analysis should always be carried out in conjunction with correlation analysis to find the goodness of fit. Types of regression analysis include: Linear regression describes a linear relationship with one input and one output. It fits the data to a linear regression equation: y = 1 b0 + b x (2-19) The parameters can be found by the following equations: S xy b 1 = (2-20) S xx Where: b0 = y b1 x (2-21) n 2 xx = (xi x) i= 1 S (2-22) S xy n = (x x) (y y) i=1 i i (2-23) Logistic regression with the probability as the output.

44 27 Multiple linear regression. Strictly speaking multiple linear regression is a form of multivariate analysis although not usually referred to as such. It is a type of linear regression that relates the response to several inputs: x 1 x 2 x 3 PROCESS Y The regression equation is of the form: y = b + (2-24) 0 + b1 x1 + b2x2 b3x3 In statistics, logistic regression is a model used for prediction of the probability of occurrence of an event by fitting data to a logistic curve. A graph of the function is shown in figure 2-2. Logistic regression makes use of several predictor variables that may be either numerical or categorical. The method is used extensively in the medical and social sciences. Other names for logistic regression used in various other application areas include logistic model, logit model, and maximum-entropy classifier. Logistic regression is one of a class of models known as generalized linear models. Generalized linear models were formulated by John Nelder and Robert Wedderburn as a way of unifying various other statistical models, including linear regression, logistic regression and Poisson regression under one framework. Figure 2-2: The logistic function, with z on the horizontal axis and f(z) on the vertical axis. Logistic function is expressed by Equation The "input" is z and the "output" is f(z). The logistic function is useful because as an input it can take any value from

45 28 negative infinity to positive infinity, whereas the output is confined to the values between 0 and 1. The variable z represents the exposure to some set of risk factors, while f(z) represents the probability of a particular outcome, given that set of risk factors. 1 f(z) = 1 + e z (2-25) The variable z is a measure of the total contribution of all the risk factors used in the model and is known as the logit. The variable z is usually defined as z = β 0 + β 1 x 1 + β 2 x 2 + β 3 x β k x k, (2-26) Where β 0 is called the "intercept" and β 1, β 2, β 3, and so on, are called the "regression coefficients" of x 1, x 2, and x 3, respectively. The intercept is the value of z when the value of all risk factors is zero (i.e., the value of z in someone with no risk factors). Each of the regression coefficients describes the size of the contribution of that risk factor. A positive regression coefficient means that that risk factor increases the probability of the outcome, while a negative regression coefficient means that that risk factor decreases the probability of that outcome; a large regression coefficient means that that risk factor strongly influences the probability of that outcome; while a near-zero regression coefficient means that that risk factor has little influence on the probability of that outcome. Logistic regression is a useful way of describing the relationship between one or more risk factors (e.g., age, gender, etc.) and an outcome such as death (which only takes two possible values: dead or not dead).

46 29 Chapter III

47 30 3. Aims Gathered data were processed with the use of different chemometric tools: statistics (description and prediction of the sample or measurement properties and populations), visualisation (representation and projection of the complex multidimensional data into a 2- D space), partition (assortment and definition of learning-, test- and control-samples), optimisation (assortment and definition of the most appropriate conditions, variables and properties at given border requirements), modelling (quantitative foretelling of the object properties), and classification (sample forecast to a category). The Thesis reveals different fields of research: (1) research of food products, (2) research in environmental chemistry (3) research in clinical chemistry considering oncology patients undergoing different laboratory tests and (4) contemporary use of chemometrics, statistics and informatics. Using chemometric techniques the following tasks were realized: (a) Characterisation and classification of pumpkin seed oils regarding their chemical and sensorial properties. (b) Multivariate analysis of wine samples concerning classification and determination of origin and vintage of different wine sorts. (c) Characterisation and classification of mineral waters with regard to their physical-chemical properties. (d) Chemometric evaluation of pollutants contents monitored in the collected air samples, search for the correlation among the concentrations of the selected chemical components in air and some meteorological parameters. (e) Handling of analytical, biochemical as well as personal characteristics of the patients undergoing morphine treatment. Applying different chemometric tools, the correlations among the concentration of morphine and its metabolites in human serum will be presented together with their dependency on daily doses, sort of medicine, personal characteristics and gender of patient, type of tumour and traditional biochemical parameters. The following original scientific contributions are described in this thesis: (a) Determination of quality of the selected pumpkin seed oils using measured UV-Vis, NIR in FTIR spectra at different wavelengths and performed sensorial data analysis (evaluation with regard to taste, odour and colour). (b) Composition of classification models of wine samples and determination of their authenticity with regard to different criteria (vintage, producer, and sort of wine) connected to sensorial evaluation of wine samples (evaluation with regard to taste, colour and bouquet). (c) Classification, chemometric characterisation and comparison of 50 different mineral water samples (37 from Slovenia and 13 from other European countries) with regard to their physical parameters and ion contents. (d) Composition of mathematical models and graphical illustration of the correlation among the air pollutants and meteorological conditions valid at the time of sampling. (e) Demonstration of correlations among the concentrations of morphine and its main metabolites in blood serum taken from oncology patients, which are under treatment, and determination of dependency of the above mentioned concentrations upon administered drug (daily doses, sort of medicament), patients personal characteristics and gender, tumour location and measured biochemical parameters.

48 31 Chapter IV

49 32 4. Chemometric characterisation of pumpkin seed oil samples 4.1. Abstract The main outcome of the work was elaboration of classification models for edible oil samples representing the most widespread brands of Austrian pumpkin seed oil. A spectral characterization of pumpkin seed oils, supplemented by sensory evaluation of the oil quality was performed. UV-Vis, NIR and FTIR absorption spectroscopy and fluorescence spectroscopy were used as the measurement techniques and were obtained together with their basic sensorial classification. Chemometrical processing of the measured data enabled the detection of the most important spectral features, which are crucial for categorization of oils into two or three classes according to their sensory quality evaluated by a panel of experts. The sensory analysis of the collected oil samples served for categorizing them into either 2 basic classes (fully satisfactory, not fully satisfactory) or 3 classes (excellent, satisfactory and bad quality samples). Further sub-classification of the bad quality samples was made according to their typical odour, taste and colour. Chemometrical data processing, mainly multidimensional data analysis, enabled detection of the most informative properties concerning the oil quality. Classification of oil samples by sensorial quality was successfully performed by several techniques of discriminant analysis. The chemometrical processing of the spectral data was made by the Principal Component Analysis and mainly the Linear Discriminant Analysis (LDA) multivariate techniques, aiming to discover the relations between the spectral and sensory sample properties. In this study severe problems were caused by the co-linearity of the variables (spectral wavelengths, wave numbers). The key procedure for the exclusion of highly correlated variables from the original huge data matrices was based on the Correlation Analysis concerning all used variables for the given type of the spectrum. Only after this tedious procedure a further variable reduction step was possible by the stepwise backward selection method in the LDA. The elaborated models thus make it possible to predict the category into which a hitherto unclassified oil sample belongs considering classification into either two categories, containing oils with overall acceptable scores or oils that were not accepted, or three categories, involving oils fulfilling all quality criteria, oils with good scores and not accepted oils. This will prospectively facilitate the determination of chemical substances responsible for bad taste, odour and colour of the respective oil brands, as well as finding substances contributing to the excellent sensorial perception of some tested products using only a small number of spectral variables Introduction The quality of food is generally based on flavour, colour, and nutrition value. Analysis of raw and processed food is specific since it deals with a complex mixture of chemical compounds causing a considerable matrix effect. Vegetable oils are important food

50 33 components, essential in global nutrition. Depending on the regional conditions, a variety of oils are produced in different qualities. Standardised and optimised methodology is a necessary tool for the analysis that includes the separation and determination of complicated food additives, preservatives, contaminants, degradation products, residues, pesticides, adulterating substances, trace elements and other markers of food quality. For this purpose, high-performance instrumental techniques have been elaborated in analytical chemistry laboratories. If necessary, even a combination of techniques can be effectively used. Analysis of food products, both raw and processed food, is specific due to an enormous matrix effect coming from such a complex mixture of chemical compounds. Moreover, it is necessary to consider food samples as a dynamic system where the ruling equilibria are not sufficiently known and understood. Therefore, the comparison of several independent instrumental techniques and different procedures is favourable and often necessary. All of this, together with an immense capability of data production by modern analytical instrumentation, leads to a massive flow of data waiting for a meaningful interpretation. Such a task finally leads to the need for an organised data analysis approach with the use of contemporary chemometrical techniques [Massart et al. 1997, Vandenginste et al. 1998, Otto 1999]. Even a huge table (or matrix) of data can be effectively explored by chemometrical multivariate data analysis [Sharma 1996, Mellinger 1987], which reveals mutual relationships among the objects (e.g., food samples) and variables/factors (chemical, physical or sensorial properties, etc.). With the use of the electronic computer and sophisticated mathematical techniques, hidden links and correlations in the data matrix can be revealed and the most important features of the samples can be selected. A very favourable characteristic of chemometrical techniques is their very general applicability. For example, they have been effectively used for classification purposes irrespective of what kind of food was examined milk, wine, vegetable oils, wheat and other cereals, meat and fish, fruit and vegetables. These techniques are applicable also outside the limits of food chemistry, e.g., in clinical chemistry, biochemistry, meteorology [Balla et al. 2001, Kavkova et al. 2001, Balla et al. 2002, Netriova et al. 2006, Šnuderl et al. 2007, Kiralyova et al. 2008]. Vegetable oils are essential in global nutrition. The sorts of pumpkin seed oil are produced from the seeds of pumpkins (Cucurbita pepo L.). The main cultivation areas are South and East Austria and neighbouring countries, southern parts of North America and Central America, some regions in Africa as well as South China. Besides its main use for the production of edible oil, the pumpkin seeds have been utilised also in pharmaceutical industry [Schiebel-Schlosser et al. 1998, Al Zhair et al. 2000]. Some of the species contained in pumpkin oil bring relief from disorders of the prostate gland and urinary bladder caused by hyperplasia (BHP). Latest studies on 2245 BHP patients showed a significant improvement of the illness symptoms after treatment with a pumpkin seed extract. They are also effective in prevention of the mentioned disease. Pumpkin seed oils enjoy special and increasing popularity mainly due to their characteristic taste especially in the Central European region. In addition to a liked taste of many brands of pumpkin oils, several chemical compounds present in pumpkin seeds have important bioactive properties. These oils consist of approximately 70% unsaturated fatty acids; they also contain a number of hydrocarbons, tri-terpenoides, carotenoides, tocopheroles and phytosteroles [Younis et al. 2000, Murkovic et al. 1996]. The oil is not contained in the fruit, as e.g., in case of olive oils, but in the seeds, which consist mainly of fat (around 50%) and proteins (around 40%). The production of the oil is labour-intensive and requires a number of steps and precautions, which have an impact on its final quality. Consequently, the oil cannot be obtained by mere pressing of the cold fruit as in case of olive oil but a denaturation of the proteins is first required. Therefore, according to a most typical procedure, the dried and ground seeds are mixed with water and traditionally

51 34 heated to at least 90 ºC. The oil cake, generated in this step, then yields the oil by pressing. Due to more complicated technology and high content of very sensitive components, the entire cycle of production is critical with respect to the oil quality. In addition, the oil quality also depends on the geographical origin, seasonal variations and climatic influences [Pinelli et al. 2003]. Due to a complex character of pumpkin oil samples, several independent instrumental techniques and different analytical procedures should be used for the oil characterization [Lankmayr et al. 2004]. In our work, a comprehensive spectral characterisation of the pumpkin oil samples was performed. For this purpose, electronic spectra in UV and visible region, near infrared spectra, Fourier transform infrared spectra and fluorescence spectra were recorded. Edible oils belong to the commodities, which are very frequent objects of falsification [Petka et al. 2001, Lankmayr et al. 2004]. It is therefore necessary to develop such procedures which make possible their characterization in detail, classification by the chosen classification criteria, and finally, their authentication [Penza et al. 2004, Ferreira et al. 2007]. To perform the last task it is necessary to investigate and verify the selected samples with regard to the vegetable variety, origin location or confirming the producer, and sometimes also the year of production. Various analytical methods and chemometrical techniques were used and described for characterization and classification of edible oils. A very important problem of their authentication is concerned in literature [Dennis 1998, Mannina et al. 2001, Mannina et al. 2003, Brescia et al. 2003, Brodnjak-Vončina et al. 2005]; some of the published works are focused to the spectral properties [Vigli et al. 2003, Downey et al. 2003, Christy et al. 2004, Marini et al. 2004, Rezzi et al. 2005, D Imperio et al. 2007], which were proved, in general, most advantageous when investigating vegetable oils. Tied up to the measured spectral data, several chemometric techniques are used in this work. A big advantage of the chemometrical data processing is its universal character and abstract formalism, so that the same approach is applicable for solving various biochemical or even biological problems Methods Analysed samples. One hundred eighty-six commercially available pumpkin seed oils of Styria origin were collected. The samples were catalogued and prepared for sensorial as well as spectroscopic characterisation. Among them, less than one half was fully characterised by all used techniques so that they create the research basis for this work. The molecular absorption UV Vis spectra of 70 pumpkin seed oil samples were recorded and absorbances were measured at 226 wavelengths. The Near IR spectra (NIR) were recorded and the signal was measured at 3464 wavelengths for 80 pumpkin oil samples. The FTIR spectral signal was measured at 2592 wave numbers for 82 pumpkin seed oil samples. The dominant part of the samples of Styria origin was the same in measurements by all above mentioned techniques. For fluorescence measurements, which were performed later and the originally taken samples were too old, further sets of

52 35 samples was used (36 samples categorized in 2 classes and 56 samples for 3 classes), in which only 11 samples were consistent with those measured by the UV-Vis, NIR a FTIR spectra. In addition to the spectral characterisation of the pumpkin seed oil samples, their sensory characteristics were also evaluated so that these samples could be used as the training sets for a further chemometrical data processing. The sensorial properties of both kinds of oil samples were evaluated in a ten point scale by the panel of experts who rated smell, taste and visual character of the sample. The presence of water was thoroughly checked since a colloidal mixture of water and residual proteins considerably diminishes the oil quality. Sensorial quality of the collected oil samples served for categorizing them into (a) two basic classes: oils with overall acceptable scores ( good ) vs. not satisfactory ( bad ) oils; (b) three classes: oils with highest-quality scores ( excellent ), medium score ( satisfactory ) and not acceptable score ( bad ) quality samples. In this way a target categorical variable was formed, which is needed for chemometrical oil classification by discriminant techniques. Further sub-classification of the bad-quality samples was made according to their typical odour, taste and colour, e.g., burnt, bitter, coarse, faint, etc Instrumental measurements. Computer-controlled spectrophotometer Varian, Cary 50 Conc (Varian, Victoria, Australia) with a 1-cm cuvette was used for spectrophotometric measurements. For data acquisition and processing, the software package Cary Win UV was used. Absorption spectra of the diluted (1:300, v/v) solution of pumpkin oil in isooctane (spectroscopy grade, Merck, Darmstadt, Germany) were measured in the region nm. The spectra were digitised each at 2 nm and saved to the hard disc of a PC, so that the absorbance values at 226 wavelengths were used as the UV Vis spectral variables. Near-infrared spectra were recorded by an Analect Diamond-20 FT-NIR spectrometer (Hamilton Sundstrand, Pomona, CA, USA) with Analect FX90 software for data processing. The oil samples were filled directly into a 0.1-cm path-length quartz cell ( QS, Hellma, Germany), which was held at 25 C with a modified temperature control system of a density meter (DMA58, Anton Paar, Austria). The NIR spectra were measured in the region between 3074 and nm and digitised using increments of ca. 2 nm, which corresponds to 3464 signals (NIR spectral variables) collected for each oil sample. A Fourier-transform infrared spectrometer MATTSON 3000 (Mattson Instruments, Bucks, England), interfaced to a personal computer, was used for obtaining the FTIR spectra. For data acquisition and processing, the software Win FIRST (Fourier Infrared Software Tools for Microsoft Windows from ATI Mattson) was used. To obtain infrared spectra, one or two drops of the samples, using a 5-Al micropipette, were placed between two circular pieces of well-polished CaF 2 crystal windows without using a spacer, so that sample thickness was not exactly defined. The FTIR spectra were measured in the region between 1000 and 3498 cm -1 and digitised with an increment of slightly less than 1 cm -1 so that the absorption signal was known at 2592 wave numbers for each sample (FTIR spectral variables). All measurements were made at the laboratory temperature 22 C. Fluorescent spectra were recorded by means of the Cary spectrophotometer equipped by Cary Total Fluorescence Accessory. The corresponding samples were diluted 1:300 (v/v) by isooctane (spectroscopy grade, Merck, Darmstadt, Germany).

53 36 The recorded spectra of all kinds were digitized using the selected step in wavelength or wave number and saved to the PC hard disc; the absorbances of the digitized spectra were used as the spectral variables suitable for further chemometric processing Chemometric methodology. Chemometrical tools were used first for a proper data sets/subsets preparation and then for further data processing: creation of new data matrices, matrix transposition, standardization, mean centring, and sorting according to different criteria. The original data matrices assembled in MS Excel worksheets were mostly very large: 2592x82 (FTIR), 3464x80 (NIR) but not very large 226x70 in the case of the UV Vis spectra. These matrices consisted of the spectral variables (wavelengths, wl, in the UV Vis and NIR region or wave numbers, wn, in the IR region) in rows and the oil samples in columns because the maximum allowed number of columns in Excel, where the experimental data were exported, is 256. For data processing by the multivariate data analysis (MDA) methods, these matrices had to be transposed to have spectral variables in columns; the SAS software package was used for this procedure as it is capable to cope with very large matrices. Then, the samples were characterised by descriptive statistics. The maximum (max), minimum (min), mean (mean), median (med) and standard deviation (STD) were calculated for each variable. Four methods of discriminant analysis (DA) - linear and quadratic discriminant analysis (LDA, QDA), logistic regression (LR) and K-th nearest neighbour method (KNN) were used for classification of the oil samples. Success in the oil sample classification was evaluated using two ways of cross validation: by a leave-one-out (jack-knife) method as well as using a special test set of the samples not included into training process, by which the classification model was calculated. Classification success was given as the ratio of the number of correctly classified samples to the total number of samples and expressed also in %. The calculations were performed mainly by the commercial software packages SPSS 15.0, SAS and Statgraphics Plus, ver Results and discussion Pretreatment of data, feature reduction. The main goal of the data pre-treatment was to reduce the number of variables (wl, wn) and prepare much smaller data matrices. This process was realised in two steps consisting of sorting variables and reduction of variables (called as features in chemometrics). Because the preliminary MDA studies showed a severe co-linearity among the variables in all examined kinds of spectra, it was advantageous to sort the spectral variables according to a suitable criterion. If variables were unsorted, the colinearity among the variables (strong inter-relations) caused that just those variables remained after the variable reduction procedure, which were first on the original variable list where they had been ordered according to the increased value of wl or wn. Therefore,

54 37 the measured variables were sorted by (a) the maximum difference (max-min) in each variable; (b) the maximum difference medgood-medbad that was found for each variable by a separate evaluation of the medians for the oil samples pre-classified as good and bad, respectively; and (c) the maximum difference meangood-meanbad evaluated similarly but using the means. The next step was the Correlation Analysis (CA) applied to the matrix containing sorted variables. The CA computes Pearson correlation coefficients between all pairs of variables and was used in our case to get rid of the wavelengths/wave numbers, which are highly correlated and therefore cause co-linearity problems (for illustration a fragment of the correlation table is depicted in Figure 4-1). Several levels of the correlation coefficient value have been applied for setting the cut-off value, above which all highly correlated variables are excluded from further computations. Avery high cut-off value, e.g., r = for the UV Vis and 0.80 for the FTIR data, was necessary to perform a considerable variable reduction, which enabled us to select finally only 64 most important variables, which are allowed at maximum in some software packages, e.g., in Statgraphics, so that the MDA results could be obtained and compared by different software. However, this way has not resulted in a perfect classification in the created training sets. It is worth mentioning that neither of the maximum difference principles (using medians or means and described above) provided a full success in classifying the mentioned training sets. Figure 4-1: A fragment of the correlation table illustrating the correlation coefficient values characteristic for correlations among wavenumbers in FTIR spectra of pumpkin oil samples.

MULTIVARIATE PATTERN RECOGNITION FOR CHEMOMETRICS. Richard Brereton

MULTIVARIATE PATTERN RECOGNITION FOR CHEMOMETRICS. Richard Brereton MULTIVARIATE PATTERN RECOGNITION FOR CHEMOMETRICS Richard Brereton r.g.brereton@bris.ac.uk Pattern Recognition Book Chemometrics for Pattern Recognition, Wiley, 2009 Pattern Recognition Pattern Recognition

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 2. Overview of multivariate techniques 2.1 Different approaches to multivariate data analysis 2.2 Classification of multivariate techniques

More information

Multivariate Analysis of Ecological Data using CANOCO

Multivariate Analysis of Ecological Data using CANOCO Multivariate Analysis of Ecological Data using CANOCO JAN LEPS University of South Bohemia, and Czech Academy of Sciences, Czech Republic Universitats- uric! Lanttesbibiiothek Darmstadt Bibliothek Biologie

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world

More information

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Article from. Predictive Analytics and Futurism. July 2016 Issue 13 Article from Predictive Analytics and Futurism July 2016 Issue 13 Regression and Classification: A Deeper Look By Jeff Heaton Classification and regression are the two most common forms of models fitted

More information

Newey, Philip Simon (2009) Colony mate recognition in the weaver ant Oecophylla smaragdina. PhD thesis, James Cook University.

Newey, Philip Simon (2009) Colony mate recognition in the weaver ant Oecophylla smaragdina. PhD thesis, James Cook University. This file is part of the following reference: Newey, Philip Simon (2009) Colony mate recognition in the weaver ant Oecophylla smaragdina. PhD thesis, James Cook University. Access to this file is available

More information

Course in Data Science

Course in Data Science Course in Data Science About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst. The course gives an

More information

DIMENSION REDUCTION AND CLUSTER ANALYSIS

DIMENSION REDUCTION AND CLUSTER ANALYSIS DIMENSION REDUCTION AND CLUSTER ANALYSIS EECS 833, 6 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and resources available at http://people.ku.edu/~gbohling/eecs833

More information

Statistics Toolbox 6. Apply statistical algorithms and probability models

Statistics Toolbox 6. Apply statistical algorithms and probability models Statistics Toolbox 6 Apply statistical algorithms and probability models Statistics Toolbox provides engineers, scientists, researchers, financial analysts, and statisticians with a comprehensive set of

More information

Table of Contents. Multivariate methods. Introduction II. Introduction I

Table of Contents. Multivariate methods. Introduction II. Introduction I Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Chemometrics: Classification of spectra

Chemometrics: Classification of spectra Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36 Contents Terminology Introduction Big picture

More information

Principal Component Analysis, A Powerful Scoring Technique

Principal Component Analysis, A Powerful Scoring Technique Principal Component Analysis, A Powerful Scoring Technique George C. J. Fernandez, University of Nevada - Reno, Reno NV 89557 ABSTRACT Data mining is a collection of analytical techniques to uncover new

More information

Classification 1: Linear regression of indicators, linear discriminant analysis

Classification 1: Linear regression of indicators, linear discriminant analysis Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36-462/36-662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1 Classification

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Introduction Edps/Psych/Stat/ 584 Applied Multivariate Statistics Carolyn J Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN c Board of Trustees,

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

DISCRIMINATION AND CLASSIFICATION IN NIR SPECTROSCOPY. 1 Dept. Chemistry, University of Rome La Sapienza, Rome, Italy

DISCRIMINATION AND CLASSIFICATION IN NIR SPECTROSCOPY. 1 Dept. Chemistry, University of Rome La Sapienza, Rome, Italy DISCRIMINATION AND CLASSIFICATION IN NIR SPECTROSCOPY Federico Marini Dept. Chemistry, University of Rome La Sapienza, Rome, Italy Classification To find a criterion to assign an object (sample) to one

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee

Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee Module - 04 Lecture - 05 Sales Forecasting - II A very warm welcome

More information

Chemometrics. 1. Find an important subset of the original variables.

Chemometrics. 1. Find an important subset of the original variables. Chemistry 311 2003-01-13 1 Chemometrics Chemometrics: Mathematical, statistical, graphical or symbolic methods to improve the understanding of chemical information. or The science of relating measurements

More information

Power Analysis. Ben Kite KU CRMDA 2015 Summer Methodology Institute

Power Analysis. Ben Kite KU CRMDA 2015 Summer Methodology Institute Power Analysis Ben Kite KU CRMDA 2015 Summer Methodology Institute Created by Terrence D. Jorgensen, 2014 Recall Hypothesis Testing? Null Hypothesis Significance Testing (NHST) is the most common application

More information

Machine Learning 2nd Edition

Machine Learning 2nd Edition INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/machinelearning/ The MIT Press, 2010

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning

More information

Issues and Techniques in Pattern Classification

Issues and Techniques in Pattern Classification Issues and Techniques in Pattern Classification Carlotta Domeniconi www.ise.gmu.edu/~carlotta Machine Learning Given a collection of data, a machine learner eplains the underlying process that generated

More information

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Discrimination of Dyed Cotton Fibers Based on UVvisible Microspectrophotometry and Multivariate Statistical Analysis

Discrimination of Dyed Cotton Fibers Based on UVvisible Microspectrophotometry and Multivariate Statistical Analysis Discrimination of Dyed Cotton Fibers Based on UVvisible Microspectrophotometry and Multivariate Statistical Analysis Elisa Liszewski, Cheryl Szkudlarek and John Goodpaster Department of Chemistry and Chemical

More information

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 1: August 22, 2012

More information

MACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS

MACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS MACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS MATTHEW J. CRACKNELL BSc (Hons) ARC Centre of Excellence in Ore Deposits (CODES) School of Physical Sciences (Earth Sciences) Submitted

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Mixture Analysis Made Easier: Trace Impurity Identification in Photoresist Developer Solutions Using ATR-IR Spectroscopy and SIMPLISMA

Mixture Analysis Made Easier: Trace Impurity Identification in Photoresist Developer Solutions Using ATR-IR Spectroscopy and SIMPLISMA Mixture Analysis Made Easier: Trace Impurity Identification in Photoresist Developer Solutions Using ATR-IR Spectroscopy and SIMPLISMA Michel Hachey, Michael Boruta Advanced Chemistry Development, Inc.

More information

Master of Science in Statistics A Proposal

Master of Science in Statistics A Proposal 1 Master of Science in Statistics A Proposal Rationale of the Program In order to cope up with the emerging complexity on the solutions of realistic problems involving several phenomena of nature it is

More information

Multivariate calibration

Multivariate calibration Multivariate calibration What is calibration? Problems with traditional calibration - selectivity - precision 35 - diagnosis Multivariate calibration - many signals - multivariate space How to do it? observed

More information

STATISTICAL LEARNING SYSTEMS

STATISTICAL LEARNING SYSTEMS STATISTICAL LEARNING SYSTEMS LECTURE 8: UNSUPERVISED LEARNING: FINDING STRUCTURE IN DATA Institute of Computer Science, Polish Academy of Sciences Ph. D. Program 2013/2014 Principal Component Analysis

More information

Techniques and Applications of Multivariate Analysis

Techniques and Applications of Multivariate Analysis Techniques and Applications of Multivariate Analysis Department of Statistics Professor Yong-Seok Choi E-mail: yschoi@pusan.ac.kr Home : yschoi.pusan.ac.kr Contents Multivariate Statistics (I) in Spring

More information

URBAN LAND COVER AND LAND USE CLASSIFICATION USING HIGH SPATIAL RESOLUTION IMAGES AND SPATIAL METRICS

URBAN LAND COVER AND LAND USE CLASSIFICATION USING HIGH SPATIAL RESOLUTION IMAGES AND SPATIAL METRICS URBAN LAND COVER AND LAND USE CLASSIFICATION USING HIGH SPATIAL RESOLUTION IMAGES AND SPATIAL METRICS Ivan Lizarazo Universidad Distrital, Department of Cadastral Engineering, Bogota, Colombia; ilizarazo@udistrital.edu.co

More information

Multiplex network inference

Multiplex network inference (using hidden Markov models) University of Cambridge Bioinformatics Group Meeting 11 February 2016 Words of warning Disclaimer These slides have been produced by combining & translating two of my previous

More information

A Introduction to Matrix Algebra and the Multivariate Normal Distribution

A Introduction to Matrix Algebra and the Multivariate Normal Distribution A Introduction to Matrix Algebra and the Multivariate Normal Distribution PRE 905: Multivariate Analysis Spring 2014 Lecture 6 PRE 905: Lecture 7 Matrix Algebra and the MVN Distribution Today s Class An

More information

Unsupervised Learning Methods

Unsupervised Learning Methods Structural Health Monitoring Using Statistical Pattern Recognition Unsupervised Learning Methods Keith Worden and Graeme Manson Presented by Keith Worden The Structural Health Monitoring Process 1. Operational

More information

An Introduction to Path Analysis

An Introduction to Path Analysis An Introduction to Path Analysis PRE 905: Multivariate Analysis Lecture 10: April 15, 2014 PRE 905: Lecture 10 Path Analysis Today s Lecture Path analysis starting with multivariate regression then arriving

More information

2. Sample representativeness. That means some type of probability/random sampling.

2. Sample representativeness. That means some type of probability/random sampling. 1 Neuendorf Cluster Analysis Model: X1 X2 X3 X4 X5 Clusters (Nominal variable) Y1 Y2 Y3 Clustering/Internal Variables External Variables Assumes: 1. Actually, any level of measurement (nominal, ordinal,

More information

Semi-Quantitative Analysis of Analytical Data using Chemometric Methods. Part II.

Semi-Quantitative Analysis of Analytical Data using Chemometric Methods. Part II. Semi-Quantitative Analysis of Analytical Data using Chemometric Methods. Part II. Simon Bates, Ph.D. After working through the various identification and matching methods, we are finally at the point where

More information

Review of Multiple Regression

Review of Multiple Regression Ronald H. Heck 1 Let s begin with a little review of multiple regression this week. Linear models [e.g., correlation, t-tests, analysis of variance (ANOVA), multiple regression, path analysis, multivariate

More information

Analysis of Variance and Co-variance. By Manza Ramesh

Analysis of Variance and Co-variance. By Manza Ramesh Analysis of Variance and Co-variance By Manza Ramesh Contents Analysis of Variance (ANOVA) What is ANOVA? The Basic Principle of ANOVA ANOVA Technique Setting up Analysis of Variance Table Short-cut Method

More information

4:3 LEC - PLANNED COMPARISONS AND REGRESSION ANALYSES

4:3 LEC - PLANNED COMPARISONS AND REGRESSION ANALYSES 4:3 LEC - PLANNED COMPARISONS AND REGRESSION ANALYSES FOR SINGLE FACTOR BETWEEN-S DESIGNS Planned or A Priori Comparisons We previously showed various ways to test all possible pairwise comparisons for

More information

C1: From Weather to Climate Looking at Air Temperature Data

C1: From Weather to Climate Looking at Air Temperature Data C1: From Weather to Climate Looking at Air Temperature Data Purpose Students will work with short- and longterm air temperature data in order to better understand the differences between weather and climate.

More information

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Rafdord M. Neal and Jianguo Zhang Presented by Jiwen Li Feb 2, 2006 Outline Bayesian view of feature

More information

Naive Bayes classification

Naive Bayes classification Naive Bayes classification Christos Dimitrakakis December 4, 2015 1 Introduction One of the most important methods in machine learning and statistics is that of Bayesian inference. This is the most fundamental

More information

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author... From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. Contents About This Book... xiii About The Author... xxiii Chapter 1 Getting Started: Data Analysis with JMP...

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

PRINCIPAL COMPONENTS ANALYSIS

PRINCIPAL COMPONENTS ANALYSIS 121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves

More information

Application of Raman Spectroscopy for Detection of Aflatoxins and Fumonisins in Ground Maize Samples

Application of Raman Spectroscopy for Detection of Aflatoxins and Fumonisins in Ground Maize Samples Application of Raman Spectroscopy for Detection of Aflatoxins and Fumonisins in Ground Maize Samples Kyung-Min Lee and Timothy J. Herrman Office of the Texas State Chemist, Texas A&M AgriLife Research

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Contents. Preface to the second edition. Preface to the fírst edition. Acknowledgments PART I PRELIMINARIES

Contents. Preface to the second edition. Preface to the fírst edition. Acknowledgments PART I PRELIMINARIES Contents Foreword Preface to the second edition Preface to the fírst edition Acknowledgments xvll xix xxi xxiii PART I PRELIMINARIES CHAPTER 1 Introduction 3 1.1 What Is Data Mining? 3 1.2 Where Is Data

More information

Machine Learning (Spring 2012) Principal Component Analysis

Machine Learning (Spring 2012) Principal Component Analysis 1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in

More information

Quadratics and Other Polynomials

Quadratics and Other Polynomials Algebra 2, Quarter 2, Unit 2.1 Quadratics and Other Polynomials Overview Number of instructional days: 15 (1 day = 45 60 minutes) Content to be learned Know and apply the Fundamental Theorem of Algebra

More information

Relations and Functions

Relations and Functions Algebra 1, Quarter 2, Unit 2.1 Relations and Functions Overview Number of instructional days: 10 (2 assessments) (1 day = 45 60 minutes) Content to be learned Demonstrate conceptual understanding of linear

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

FIT100 Spring 01. Project 2. Astrological Toys

FIT100 Spring 01. Project 2. Astrological Toys FIT100 Spring 01 Project 2 Astrological Toys In this project you will write a series of Windows applications that look up and display astrological signs and dates. The applications that will make up the

More information

Principal component analysis

Principal component analysis Principal component analysis Motivation i for PCA came from major-axis regression. Strong assumption: single homogeneous sample. Free of assumptions when used for exploration. Classical tests of significance

More information

Classification: Linear Discriminant Analysis

Classification: Linear Discriminant Analysis Classification: Linear Discriminant Analysis Discriminant analysis uses sample information about individuals that are known to belong to one of several populations for the purposes of classification. Based

More information

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model EPSY 905: Multivariate Analysis Lecture 1 20 January 2016 EPSY 905: Lecture 1 -

More information

Data Visualization (CSE 578) About this Course. Learning Outcomes. Projects

Data Visualization (CSE 578) About this Course. Learning Outcomes. Projects (CSE 578) Note: Below outline is subject to modifications and updates. About this Course Visual representations generated by statistical models help us to make sense of large, complex datasets through

More information

Chemometrics. Classification of Mycobacteria by HPLC and Pattern Recognition. Application Note. Abstract

Chemometrics. Classification of Mycobacteria by HPLC and Pattern Recognition. Application Note. Abstract 12-1214 Chemometrics Application Note Classification of Mycobacteria by HPLC and Pattern Recognition Abstract Mycobacteria include a number of respiratory and non-respiratory pathogens for humans, such

More information

ANOVA, ANCOVA and MANOVA as sem

ANOVA, ANCOVA and MANOVA as sem ANOVA, ANCOVA and MANOVA as sem Robin Beaumont 2017 Hoyle Chapter 24 Handbook of Structural Equation Modeling (2015 paperback), Examples converted to R and Onyx SEM diagrams. This workbook duplicates some

More information

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3 University of California, Irvine 2017-2018 1 Statistics (STATS) Courses STATS 5. Seminar in Data Science. 1 Unit. An introduction to the field of Data Science; intended for entering freshman and transfers.

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Problems with Stepwise Procedures in. Discriminant Analysis. James M. Graham. Texas A&M University

Problems with Stepwise Procedures in. Discriminant Analysis. James M. Graham. Texas A&M University Running Head: PROBLEMS WITH STEPWISE IN DA Problems with Stepwise Procedures in Discriminant Analysis James M. Graham Texas A&M University 77842-4225 Graham, J. M. (2001, January). Problems with stepwise

More information

Nonparametric Functional Data Analysis

Nonparametric Functional Data Analysis Frédéric Ferraty and Philippe Vieu Nonparametric Functional Data Analysis Theory and Practice April 18, 2006 Springer Berlin Heidelberg New York Hong Kong London Milan Paris Tokyo Preface This work is

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

An Introduction to Mplus and Path Analysis

An Introduction to Mplus and Path Analysis An Introduction to Mplus and Path Analysis PSYC 943: Fundamentals of Multivariate Modeling Lecture 10: October 30, 2013 PSYC 943: Lecture 10 Today s Lecture Path analysis starting with multivariate regression

More information

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran Assignment 3 Introduction to Machine Learning Prof. B. Ravindran 1. In building a linear regression model for a particular data set, you observe the coefficient of one of the features having a relatively

More information

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Predictive analysis on Multivariate, Time Series datasets using Shapelets 1 Predictive analysis on Multivariate, Time Series datasets using Shapelets Hemal Thakkar Department of Computer Science, Stanford University hemal@stanford.edu hemal.tt@gmail.com Abstract Multivariate,

More information

Measuring Keepers S E S S I O N 1. 5 A

Measuring Keepers S E S S I O N 1. 5 A S E S S I O N 1. 5 A Measuring Keepers Math Focus Points Naming, notating, and telling time to the hour on a digital and an analog clock Understanding the meaning of at least in the context of linear measurement

More information

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information

2 D wavelet analysis , 487

2 D wavelet analysis , 487 Index 2 2 D wavelet analysis... 263, 487 A Absolute distance to the model... 452 Aligned Vectors... 446 All data are needed... 19, 32 Alternating conditional expectations (ACE)... 375 Alternative to block

More information

Topic 1. Definitions

Topic 1. Definitions S Topic. Definitions. Scalar A scalar is a number. 2. Vector A vector is a column of numbers. 3. Linear combination A scalar times a vector plus a scalar times a vector, plus a scalar times a vector...

More information

Machine Learning 2007: Slides 1. Instructor: Tim van Erven Website: erven/teaching/0708/ml/

Machine Learning 2007: Slides 1. Instructor: Tim van Erven Website:   erven/teaching/0708/ml/ Machine 2007: Slides 1 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website: www.cwi.nl/ erven/teaching/0708/ml/ September 6, 2007, updated: September 13, 2007 1 / 37 Overview The Most Important Supervised

More information

Forecasting Data Streams: Next Generation Flow Field Forecasting

Forecasting Data Streams: Next Generation Flow Field Forecasting Forecasting Data Streams: Next Generation Flow Field Forecasting Kyle Caudle South Dakota School of Mines & Technology (SDSMT) kyle.caudle@sdsmt.edu Joint work with Michael Frey (Bucknell University) and

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often

More information

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM Lecture 9 SEM, Statistical Modeling, AI, and Data Mining I. Terminology of SEM Related Concepts: Causal Modeling Path Analysis Structural Equation Modeling Latent variables (Factors measurable, but thru

More information

Oakland County Parks and Recreation GIS Implementation Plan

Oakland County Parks and Recreation GIS Implementation Plan Oakland County Parks and Recreation GIS Implementation Plan TABLE OF CONTENTS 1.0 Introduction... 3 1.1 What is GIS? 1.2 Purpose 1.3 Background 2.0 Software... 4 2.1 ArcGIS Desktop 2.2 ArcGIS Explorer

More information

CHAPTER 4: DATASETS AND CRITERIA FOR ALGORITHM EVALUATION

CHAPTER 4: DATASETS AND CRITERIA FOR ALGORITHM EVALUATION CHAPTER 4: DATASETS AND CRITERIA FOR ALGORITHM EVALUATION 4.1 Overview This chapter contains the description about the data that is used in this research. In this research time series data is used. A time

More information

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research profileanalysis Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research Innovation with Integrity Omics Research Biomarker Discovery Made Easy by ProfileAnalysis

More information

U.S. - Canadian Border Traffic Prediction

U.S. - Canadian Border Traffic Prediction Western Washington University Western CEDAR WWU Honors Program Senior Projects WWU Graduate and Undergraduate Scholarship 12-14-2017 U.S. - Canadian Border Traffic Prediction Colin Middleton Western Washington

More information

Midterm Exam, Spring 2005

Midterm Exam, Spring 2005 10-701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name

More information

UNST 232 Mentor Section Assignment 5 Historical Climate Data

UNST 232 Mentor Section Assignment 5 Historical Climate Data UNST 232 Mentor Section Assignment 5 Historical Climate Data 1 introduction Informally, we can define climate as the typical weather experienced in a particular region. More rigorously, it is the statistical

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1 Introduction and Problem Formulation In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1.1 Machine Learning under Covariate

More information

The Perceptron. Volker Tresp Summer 2016

The Perceptron. Volker Tresp Summer 2016 The Perceptron Volker Tresp Summer 2016 1 Elements in Learning Tasks Collection, cleaning and preprocessing of training data Definition of a class of learning models. Often defined by the free model parameters

More information

INTRODUCTION TO DESIGN AND ANALYSIS OF EXPERIMENTS

INTRODUCTION TO DESIGN AND ANALYSIS OF EXPERIMENTS GEORGE W. COBB Mount Holyoke College INTRODUCTION TO DESIGN AND ANALYSIS OF EXPERIMENTS Springer CONTENTS To the Instructor Sample Exam Questions To the Student Acknowledgments xv xxi xxvii xxix 1. INTRODUCTION

More information