Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 2. Overview of multivariate techniques 2.1 Different approaches to multivariate data analysis 2.2 Classification of multivariate techniques KEH Basics of Multivariate Modelling and Data Analysis 1

2. Overview of multivariate techniques 2.1 Different approaches to multivariate data analysis Two main approaches Essentially, there are two different approaches: Model-driven approach, where data is seen as realizations of random variables of an underlying statistical model. This interpretation is usually favoured by statisticians. Data-driven approach, where the statistical tools are basically seen as algorithms to obtain results. This view is typical of chemometricians and data miners. However, these are not two different ways for solving the same problem. Traditional statistical (model-driven) methods do not work well when the statistical properties are unknown (i.e. assumptions are not fulfilled) the number of variables is large compared to the number of objects (samples, observations, measurements), sometimes larger than the variables are strongly correlated Data-driven methods are developed to handle this kind of data. KEH Basics of Multivariate Modelling and Data Analysis 2

2. Overview of multivariate techniques 2.1 Different approaches Chemometrics Chemometrics is the science of extracting information from chemical systems (chemistry, biochemistry, chemical engineering, etc.) by data-driven means a highly cross-disciplinary activity using methods from applied mathematics, statistics, informatics and computer science applied to datasets which are often very large and highly complex, involving hundreds to tens of thousands of variables, and hundreds to millions of cases or observations applied to solve both descriptive and predictive problems descriptive application: modelling with the intent of learning the underlying relationships and structure of the system (i.e. model identification) predictive application: modelling with the intent of predicting new properties or behaviour of interest We will mainly deal with applications of chemometrics in this course. KEH Basics of Multivariate Modelling and Data Analysis 3

2. Overview of multivariate techniques 2.1 Different approaches Data mining Data mining is the science of extracting information from large data sets and databases an integration of techniques from statistics, mathematics, machine learning, database technology, data visualization, pattern recognition, signal processing, information retrieval, high-performance computing, often applied to huge data sets involving gigabytes (2 30 ~10 9 bytes; the Human Genome Project), terabytes (2 40 ~10 12 bytes; space and earth sciences), soon even petabytes (2 50 ~10 15 bytes) applied to solve both descriptive and predictive problems descriptive data mining (or unsupervised learning): search massive data sets and discover the locations of unexpected structures or relationships, patterns, trends, clusters and outliers predictive data mining (or supervised learning): build models and procedures for regression, classification, pattern recognition and machine learning tasks; assess the predictive accuracy of those methods when applied to new data Obviously, data mining methods are also used in chemometrics. KEH Basics of Multivariate Modelling and Data Analysis 4

2. Overview of multivariate techniques 2.2 Classification of multivariate techniques Data matrices Data is assumed to be available in a data matrix X, where each column x j, j = 1,, p, contains n observations (measurements). Each column in X represents a variable and each row contains measurements of all variables in a sample (or time instant). These variables are termed independent variables (although they may be highly correlated). In addition there may be data of a number of dependent variables in a matrix Y, where each column y k, k = 1,, q, contains n observations. Each column in Y represents a variable and each row contains measurements of all dependent variables in a sample (or time instant). X = Y = x 11 x 21 x n1 x 12 x 22 xn2 x1 x2 x 1p x 2 p y 11 y 21 y n1 y 12 y 22 yn2 y y1 2 xnp x y q p y 1q y 2q ynq KEH Basics of Multivariate Modelling and Data Analysis 5

2. Overview of multivariate techniques 2.2 Classification Main classification criterion Our main classification criterion is the number of dependent variables. The classes are no dependent variable, q = 0 one dependent variable, q = 1 many dependent variables, q > 1 Note that this classification is determined by how we choose to treat the data, not by the true (unknown) dependencies in the data set. Modelling (like regression) is termed simple, if there is only one independent variable (e.g. simple regression) multiple, if there is one dependent variable but many independent ones multivariate, if there is many dependent and many independent variables The case with one independent variable is handled by classical univariate statistics, and will not be treated here. In addition, variables may be classified according to the type of measurement: metric (quantitative, ~ continuous) nonmetric (qualitative, categorical, discrete, often binary) KEH Basics of Multivariate Modelling and Data Analysis 6

2. Overview of multivariate techniques 2.2 Classification 2.2.1 No dependent variable In these methods, only the data matrix X is considered. The data may be metric or nonmetric. 2.2.1.1 Principal component analysis (PCA) PCA is a method that can be used to analyse interrelationships among a large number of variables explain these variables in terms of their common underlying components condense the information in a number of original variables into a smaller set of principal components with a minimal loss of information Mathematically, we want to find the weights p jl that maximize the variance of each t l in such a way that every t l is uncorrelated with every t m, m l. t = x p + x p + + x p t = x p + x p + + x p t = x p + x p + + x p 1 1 11 2 21 p p1 2 1 12 2 22 p p2 a 1 1a 2 2a p pa Here a is the number of principal components. If cov(t l ) 0, there is no useful information in t l. Only components that contain useful information are retained. Usually a p (often a = 2 4 ). KEH Basics of Multivariate Modelling and Data Analysis 7

2.2 Classification of multivariate techniques 2.2.1 No dependent variable 2.2.1.2 Factor analysis (FA) Factor analysis is similar to PCA and can be used for the same purpose. However, unlike PCA, FA is based on a statistical model with certain assumptions. x t t t e FA FA FA FA FA FA = p + p + + p + 1 11 1 12 2 1a a 1 FA FA FA FA FA FA x = p t + p t + + p t + e 2 21 1 22 2 2a a 2 x p t p t p t e FA FA FA FA FA FA = + + + + p p11 p22 pa a p Mathematically, we want to find the weights p FA jl and the factors t FA l so that the error variance behaves in a certain way. Example. Consumer rating. Assume customers in a fast-food restaurant are asked to rate the restaurant on the following six variables: food taste, food temperature, food freshness, waiting time, cleanliness, friendliness of employees. Analysis of the customer responses by factor analysis may show that the variables food taste, temperature and freshness combine together to form a single factor food quality, whereas waiting time, cleanliness and friendliness form a factor service quality. KEH Basics of Multivariate Modelling and Data Analysis 8

2.2 Classification of multivariate techniques 2.2.1 No dependent variable 2.2.1.3 Cluster analysis (FA) Cluster analysis is an analytical technique for developing meaningful subgroups of individuals or objects. The subgroups are not predefined; they are identified by the analysis. KEH Basics of Multivariate Modelling and Data Analysis 9

2. Overview of multivariate techniques 2.2 Classification 2.2.2 One dependent variable In these methods, a vector of dependent variables, Y = y, is considered in addition to the data matrix X. The data my be metric or nonmetric. 2.2.2.1 Multiple regression analysis (MRA) MRA may be used to relate a single metric dependent variable to a number of independent variables. predict changes in the dependent variable in response to changes in the independent variables. Mathematically, we want to find the y= b0+ b1x1+ b2x2+ + bpxp + e parameters b j, j = 0,, p, that maximize the correlation between yˆ = b0+ b1x1+ b2x2+ + bpxp y and the prediction y. This is equivalent to minimizing the variance of the error e or the sum of the squared residuals. Note: This method does not work well if the independent variables are (strongly) correlated. KEH Basics of Multivariate Modelling and Data Analysis 10

2.2 Classification of multivariate techniques 2.2.2 One dependent variable 2.2.2.2 Multiple discriminant analysis (MDA) MDA is the appropriate multivariable technique if the single dependent variable y is nonmetric, either dichotomous (e.g., male female), or multichotomous (e.g., high medium low). The independent variables x j are assumed to be metric. Thus, discriminant analysis is applicable when the total sample can be divided into groups based on a nonmetric dependent variable characterizing several known classes. The primary objectives of MDA are to understand groups differences predict the likelihood that an entity (individual or object) will belong to a particular class or group based on several metric independent variables KEH Basics of Multivariate Modelling and Data Analysis 11

2.2 Classification of multivariate techniques 2.2.2 One dependent variable 2.2.2.3 Logistic regression Logistic regression models, often referred to as logit analysis, are a combination of multiple regression analysis (MRA) many independent variables multiple discriminant analysis (MDA) nonmetric dependent variable The difference with these methods is that the independent variables may be may be metric or nonmetric do not require the assumption of multivariate normality In many cases, particularly with more than two levels of the dependent variable, MDA is the more appropriate technique. KEH Basics of Multivariate Modelling and Data Analysis 12

2. Overview of multivariate techniques 2.2 Classification 2.2.3 Many dependent variables In these methods, a matrix of dependent variables, Y, is considered in addition to the data matrix X. The data my be metric or nonmetric. 2.2.3.1 Canonical correlation analysis (CCA) Canonical correlation analysis can be viewed as a logical extension of multiple regression analysis (a single metric dependent and several metric independent variables). With CCA the objective is to correlate simultaneously several metric dependent and several metric independent variables The underlying principle is to develop a linear combination of each set of variables t = p1x1+ p2x2+ + ppxp u= q y + q y + + q y 1 1 2 2 maximize the correlation cov( tu, ) cor( tu, ) = std( t) std( u) with respect to the parameters p j, j = 1,, p, and q k, k = 1,, q. q q KEH Basics of Multivariate Modelling and Data Analysis 13

2.2 Classification of multivariate techniques 2.2.3 Many dependent variables 2.2.3.2 Partial least squares (PLS) PLS stands for Partial Least Squares, or Projection to Latent Structures It is a method to relate a matrix X to a vector y or a matrix Y. Similarly to CCA, linear combinations of the X and Y data are formed: t = x1p1 + x2p2 + + xppp, = 1,, a u = y q + y q + + y q, m= 1,, b m 1 1m 2 2m q qm In PLS, p jl, j = 1,, p, and q km, k = 1,, q, are (usually) determined so that the covariances between t l and u m, l, m, are maximized. This combines high variance of t l and u m (i.e., high information content) high correlation between t l and u m (good for predictive modelling) In addition, linear relationships between t l and u m are determined by ordinary least squares (OLS). There is a similar method, principal component regression (PCR), where p jl, j = 1,, p, are determined by principal component analysis (PCA) of X. KEH Basics of Multivariate Modelling and Data Analysis 14

2.2 Classification of multivariate techniques 2.2.3 Many dependent variables 2.2.3.3 Independent component analysis (ICA) In general, ICA tries to reveal the independent factors (variables/signals) from a set of mixed random variables or measurement signals. ICA is typically restricted to linear mixtures, and the underlying sources are assumed to be mutually independent. This is a stronger assumption than no correlation (which only concerns the first two moments of probability distributions)! It is assumed that the observed random variables x 1,, x p, denoted by the observation matrix X, are the result of a linear combination of m underlying sources s 1,, s m, denoted by the source matrix S. The following model, called a noise-free ICA model, can then be assumed: X = AS T Here A is a full rank n m matrix, where n is the number of observations. Both A and S are unknown, but if the sources are statistically independent, it is possible to find A. The sources are then found according to S T = A 1 X KEH Basics of Multivariate Modelling and Data Analysis 15