Basics of Multivariate Modelling and Data Analysis

Similar documents
PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

STATISTICS 407 METHODS OF MULTIVARIATE ANALYSIS TOPICS

Dimension Reduction (PCA, ICA, CCA, FLD,

Basics of Multivariate Modelling and Data Analysis

Introduction to Machine Learning

Machine Learning 11. week

7. Variable extraction and dimensionality reduction

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Machine learning for pervasive systems Classification in high-dimensional spaces

ECE 5984: Introduction to Machine Learning

Robustness of Principal Components

ABTEKNILLINEN KORKEAKOULU

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Lecture 7: Con3nuous Latent Variable Models

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

6.867 Machine Learning

Introduction to Signal Detection and Classification. Phani Chavali

Principal Component Analysis

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Neuroscience Introduction

Principal Component Analysis, A Powerful Scoring Technique

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

CS534 Machine Learning - Spring Final Exam

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Leverage Sparse Information in Predictive Modeling

An Introduction to Statistical and Probabilistic Linear Models

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Statistics Toolbox 6. Apply statistical algorithms and probability models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning 2nd Edition

Multivariate Statistics Summary and Comparison of Techniques. Multivariate Techniques

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

PATTERN CLASSIFICATION

Vector Space Models. wine_spectral.r

CS281 Section 4: Factor Analysis and PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Linear Dimensionality Reduction

Logistic Regression: Regression with a Binary Dependent Variable

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Statistical Pattern Recognition

PCA, Kernel PCA, ICA

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Notes on Latent Semantic Analysis

STA 414/2104: Lecture 8

Machine Learning Overview

PCA and admixture models

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

An introduction to clustering techniques

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

PRINCIPAL COMPONENTS ANALYSIS

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Course in Data Science

Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares

Brief Introduction of Machine Learning Techniques for Content Analysis

Unsupervised Learning Methods

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA

Classification 2: Linear discriminant analysis (continued); logistic regression

Machine Learning for Data Science (CS4786) Lecture 12

Correlation Preserving Unsupervised Discretization. Outline

High-dimensional regression modeling

Unconstrained Ordination

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Pattern Recognition and Machine Learning

Randomized Algorithms

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Introduction to Machine Learning

Classification: Linear Discriminant Analysis

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

STA 414/2104: Lecture 8

Course content (will be adapted to the background knowledge of the class):

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Experimental Design and Data Analysis for Biologists

Machine Learning (Spring 2012) Principal Component Analysis

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Independent Component Analysis

Tensor Methods for Feature Learning

Comparative Analysis of ICA Based Features

Principal component analysis

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Introduction to Machine Learning

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Correlation and regression

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

26:010:557 / 26:620:557 Social Science Research Methods

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Transcription:

Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 2. Overview of multivariate techniques 2.1 Different approaches to multivariate data analysis 2.2 Classification of multivariate techniques KEH Basics of Multivariate Modelling and Data Analysis 1

2. Overview of multivariate techniques 2.1 Different approaches to multivariate data analysis Two main approaches Essentially, there are two different approaches: Model-driven approach, where data is seen as realizations of random variables of an underlying statistical model. This interpretation is usually favoured by statisticians. Data-driven approach, where the statistical tools are basically seen as algorithms to obtain results. This view is typical of chemometricians and data miners. However, these are not two different ways for solving the same problem. Traditional statistical (model-driven) methods do not work well when the statistical properties are unknown (i.e. assumptions are not fulfilled) the number of variables is large compared to the number of objects (samples, observations, measurements), sometimes larger than the variables are strongly correlated Data-driven methods are developed to handle this kind of data. KEH Basics of Multivariate Modelling and Data Analysis 2

2. Overview of multivariate techniques 2.1 Different approaches Chemometrics Chemometrics is the science of extracting information from chemical systems (chemistry, biochemistry, chemical engineering, etc.) by data-driven means a highly cross-disciplinary activity using methods from applied mathematics, statistics, informatics and computer science applied to datasets which are often very large and highly complex, involving hundreds to tens of thousands of variables, and hundreds to millions of cases or observations applied to solve both descriptive and predictive problems descriptive application: modelling with the intent of learning the underlying relationships and structure of the system (i.e. model identification) predictive application: modelling with the intent of predicting new properties or behaviour of interest We will mainly deal with applications of chemometrics in this course. KEH Basics of Multivariate Modelling and Data Analysis 3

2. Overview of multivariate techniques 2.1 Different approaches Data mining Data mining is the science of extracting information from large data sets and databases an integration of techniques from statistics, mathematics, machine learning, database technology, data visualization, pattern recognition, signal processing, information retrieval, high-performance computing, often applied to huge data sets involving gigabytes (2 30 ~10 9 bytes; the Human Genome Project), terabytes (2 40 ~10 12 bytes; space and earth sciences), soon even petabytes (2 50 ~10 15 bytes) applied to solve both descriptive and predictive problems descriptive data mining (or unsupervised learning): search massive data sets and discover the locations of unexpected structures or relationships, patterns, trends, clusters and outliers predictive data mining (or supervised learning): build models and procedures for regression, classification, pattern recognition and machine learning tasks; assess the predictive accuracy of those methods when applied to new data Obviously, data mining methods are also used in chemometrics. KEH Basics of Multivariate Modelling and Data Analysis 4

2. Overview of multivariate techniques 2.2 Classification of multivariate techniques Data matrices Data is assumed to be available in a data matrix X, where each column x j, j = 1,, p, contains n observations (measurements). Each column in X represents a variable and each row contains measurements of all variables in a sample (or time instant). These variables are termed independent variables (although they may be highly correlated). In addition there may be data of a number of dependent variables in a matrix Y, where each column y k, k = 1,, q, contains n observations. Each column in Y represents a variable and each row contains measurements of all dependent variables in a sample (or time instant). X = Y = x 11 x 21 x n1 x 12 x 22 xn2 x1 x2 x 1p x 2 p y 11 y 21 y n1 y 12 y 22 yn2 y y1 2 xnp x y q p y 1q y 2q ynq KEH Basics of Multivariate Modelling and Data Analysis 5

2. Overview of multivariate techniques 2.2 Classification Main classification criterion Our main classification criterion is the number of dependent variables. The classes are no dependent variable, q = 0 one dependent variable, q = 1 many dependent variables, q > 1 Note that this classification is determined by how we choose to treat the data, not by the true (unknown) dependencies in the data set. Modelling (like regression) is termed simple, if there is only one independent variable (e.g. simple regression) multiple, if there is one dependent variable but many independent ones multivariate, if there is many dependent and many independent variables The case with one independent variable is handled by classical univariate statistics, and will not be treated here. In addition, variables may be classified according to the type of measurement: metric (quantitative, ~ continuous) nonmetric (qualitative, categorical, discrete, often binary) KEH Basics of Multivariate Modelling and Data Analysis 6

2. Overview of multivariate techniques 2.2 Classification 2.2.1 No dependent variable In these methods, only the data matrix X is considered. The data may be metric or nonmetric. 2.2.1.1 Principal component analysis (PCA) PCA is a method that can be used to analyse interrelationships among a large number of variables explain these variables in terms of their common underlying components condense the information in a number of original variables into a smaller set of principal components with a minimal loss of information Mathematically, we want to find the weights p jl that maximize the variance of each t l in such a way that every t l is uncorrelated with every t m, m l. t = x p + x p + + x p t = x p + x p + + x p t = x p + x p + + x p 1 1 11 2 21 p p1 2 1 12 2 22 p p2 a 1 1a 2 2a p pa Here a is the number of principal components. If cov(t l ) 0, there is no useful information in t l. Only components that contain useful information are retained. Usually a p (often a = 2 4 ). KEH Basics of Multivariate Modelling and Data Analysis 7

2.2 Classification of multivariate techniques 2.2.1 No dependent variable 2.2.1.2 Factor analysis (FA) Factor analysis is similar to PCA and can be used for the same purpose. However, unlike PCA, FA is based on a statistical model with certain assumptions. x t t t e FA FA FA FA FA FA = p + p + + p + 1 11 1 12 2 1a a 1 FA FA FA FA FA FA x = p t + p t + + p t + e 2 21 1 22 2 2a a 2 x p t p t p t e FA FA FA FA FA FA = + + + + p p11 p22 pa a p Mathematically, we want to find the weights p FA jl and the factors t FA l so that the error variance behaves in a certain way. Example. Consumer rating. Assume customers in a fast-food restaurant are asked to rate the restaurant on the following six variables: food taste, food temperature, food freshness, waiting time, cleanliness, friendliness of employees. Analysis of the customer responses by factor analysis may show that the variables food taste, temperature and freshness combine together to form a single factor food quality, whereas waiting time, cleanliness and friendliness form a factor service quality. KEH Basics of Multivariate Modelling and Data Analysis 8

2.2 Classification of multivariate techniques 2.2.1 No dependent variable 2.2.1.3 Cluster analysis (FA) Cluster analysis is an analytical technique for developing meaningful subgroups of individuals or objects. The subgroups are not predefined; they are identified by the analysis. KEH Basics of Multivariate Modelling and Data Analysis 9

2. Overview of multivariate techniques 2.2 Classification 2.2.2 One dependent variable In these methods, a vector of dependent variables, Y = y, is considered in addition to the data matrix X. The data my be metric or nonmetric. 2.2.2.1 Multiple regression analysis (MRA) MRA may be used to relate a single metric dependent variable to a number of independent variables. predict changes in the dependent variable in response to changes in the independent variables. Mathematically, we want to find the y= b0+ b1x1+ b2x2+ + bpxp + e parameters b j, j = 0,, p, that maximize the correlation between yˆ = b0+ b1x1+ b2x2+ + bpxp y and the prediction y. This is equivalent to minimizing the variance of the error e or the sum of the squared residuals. Note: This method does not work well if the independent variables are (strongly) correlated. KEH Basics of Multivariate Modelling and Data Analysis 10

2.2 Classification of multivariate techniques 2.2.2 One dependent variable 2.2.2.2 Multiple discriminant analysis (MDA) MDA is the appropriate multivariable technique if the single dependent variable y is nonmetric, either dichotomous (e.g., male female), or multichotomous (e.g., high medium low). The independent variables x j are assumed to be metric. Thus, discriminant analysis is applicable when the total sample can be divided into groups based on a nonmetric dependent variable characterizing several known classes. The primary objectives of MDA are to understand groups differences predict the likelihood that an entity (individual or object) will belong to a particular class or group based on several metric independent variables KEH Basics of Multivariate Modelling and Data Analysis 11

2.2 Classification of multivariate techniques 2.2.2 One dependent variable 2.2.2.3 Logistic regression Logistic regression models, often referred to as logit analysis, are a combination of multiple regression analysis (MRA) many independent variables multiple discriminant analysis (MDA) nonmetric dependent variable The difference with these methods is that the independent variables may be may be metric or nonmetric do not require the assumption of multivariate normality In many cases, particularly with more than two levels of the dependent variable, MDA is the more appropriate technique. KEH Basics of Multivariate Modelling and Data Analysis 12

2. Overview of multivariate techniques 2.2 Classification 2.2.3 Many dependent variables In these methods, a matrix of dependent variables, Y, is considered in addition to the data matrix X. The data my be metric or nonmetric. 2.2.3.1 Canonical correlation analysis (CCA) Canonical correlation analysis can be viewed as a logical extension of multiple regression analysis (a single metric dependent and several metric independent variables). With CCA the objective is to correlate simultaneously several metric dependent and several metric independent variables The underlying principle is to develop a linear combination of each set of variables t = p1x1+ p2x2+ + ppxp u= q y + q y + + q y 1 1 2 2 maximize the correlation cov( tu, ) cor( tu, ) = std( t) std( u) with respect to the parameters p j, j = 1,, p, and q k, k = 1,, q. q q KEH Basics of Multivariate Modelling and Data Analysis 13

2.2 Classification of multivariate techniques 2.2.3 Many dependent variables 2.2.3.2 Partial least squares (PLS) PLS stands for Partial Least Squares, or Projection to Latent Structures It is a method to relate a matrix X to a vector y or a matrix Y. Similarly to CCA, linear combinations of the X and Y data are formed: t = x1p1 + x2p2 + + xppp, = 1,, a u = y q + y q + + y q, m= 1,, b m 1 1m 2 2m q qm In PLS, p jl, j = 1,, p, and q km, k = 1,, q, are (usually) determined so that the covariances between t l and u m, l, m, are maximized. This combines high variance of t l and u m (i.e., high information content) high correlation between t l and u m (good for predictive modelling) In addition, linear relationships between t l and u m are determined by ordinary least squares (OLS). There is a similar method, principal component regression (PCR), where p jl, j = 1,, p, are determined by principal component analysis (PCA) of X. KEH Basics of Multivariate Modelling and Data Analysis 14

2.2 Classification of multivariate techniques 2.2.3 Many dependent variables 2.2.3.3 Independent component analysis (ICA) In general, ICA tries to reveal the independent factors (variables/signals) from a set of mixed random variables or measurement signals. ICA is typically restricted to linear mixtures, and the underlying sources are assumed to be mutually independent. This is a stronger assumption than no correlation (which only concerns the first two moments of probability distributions)! It is assumed that the observed random variables x 1,, x p, denoted by the observation matrix X, are the result of a linear combination of m underlying sources s 1,, s m, denoted by the source matrix S. The following model, called a noise-free ICA model, can then be assumed: X = AS T Here A is a full rank n m matrix, where n is the number of observations. Both A and S are unknown, but if the sources are statistically independent, it is possible to find A. The sources are then found according to S T = A 1 X KEH Basics of Multivariate Modelling and Data Analysis 15