Multivariate Analysis

Similar documents
STATISTICA MULTIVARIATA 2

Techniques and Applications of Multivariate Analysis

CropCast Europe Weekly Report

Classification 2: Linear discriminant analysis (continued); logistic regression

SCHOOL OF MATHEMATICS AND STATISTICS

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Trends in Human Development Index of European Union

Linear Dimensionality Reduction

TMA4267 Linear Statistical Models V2017 [L3]

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Unsupervised Learning: Dimensionality Reduction

High Dimensional Discriminant Analysis

High Dimensional Discriminant Analysis

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning 2nd Edition

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

The EuCheMS Division Chemistry and the Environment EuCheMS/DCE

A Markov system analysis application on labour market dynamics: The case of Greece

Mathematical Formulation of Our Example

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

[7] Big Data: Clustering

5. Discriminant analysis

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the European Region

AD HOC DRAFTING GROUP ON TRANSNATIONAL ORGANISED CRIME (PC-GR-COT) STATUS OF RATIFICATIONS BY COUNCIL OF EUROPE MEMBER STATES

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the WHO European Region

Refinement of the OECD regional typology: Economic Performance of Remote Rural Regions

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the European Region

Statistical Pattern Recognition

LECTURE NOTE #10 PROF. ALAN YUILLE

Modelling structural change using broken sticks

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Directorate C: National Accounts, Prices and Key Indicators Unit C.3: Statistics for administrative purposes

L11: Pattern recognition principles

Clustering using Unsupervised Binary Trees: CUBT

FINM 33180/STAT32940: MULTIVARIATE DATA ANALYSIS VIA MATRIX DECOMPOSITIONS

Key Findings and Policy Briefs No. 2 SPATIAL ANALYSIS OF RURAL DEVELOPMENT MEASURES

Statistical Machine Learning

Bayes Decision Theory - I

20 Unsupervised Learning and Principal Components Analysis (PCA)

STATISTICS 407 METHODS OF MULTIVARIATE ANALYSIS TOPICS

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the European Region

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Motivating the Covariance Matrix

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

IEEE..- AD-A INDICATORS OF COMPARATIVE EAST-EST ECONOMIC STRENGTH 171

PATTERN CLASSIFICATION

WHO EpiData. A monthly summary of the epidemiological data on selected vaccine preventable diseases in the European Region

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the European Region

F M U Total. Total registrants at 31/12/2014. Profession AS 2, ,574 BS 15,044 7, ,498 CH 9,471 3, ,932

Pattern Recognition. Parameter Estimation of Probability Density Functions

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the WHO European Region

Bayesian Decision Theory

High Dimensional Discriminant Analysis

The Information Content of Capacity Utilisation Rates for Output Gap Estimates

Machine Learning 11. week

Gravity Analysis of Regional Economic Interdependence: In case of Japan

10/27/2015. Content. Well-homogenized national datasets. Difference (national global) BEST (1800) Difference BEST (1911) Difference GHCN & GISS (1911)

Hypothesis testing:power, test statistic CMS:

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Calories, Obesity and Health in OECD Countries

Measuring Instruments Directive (MID) MID/EN14154 Short Overview

PRINCIPAL COMPONENTS ANALYSIS

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Variance estimation on SILC based indicators

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Modelling and projecting the postponement of childbearing in low-fertility countries

PHYSICAL FEATURES OF EUROPE. Europe Unit

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Table of Contents. Multivariate methods. Introduction II. Introduction I

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Unsupervised Learning. k-means Algorithm

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Feature selection and extraction Spectral domain quality estimation Alternatives

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

The School Geography Curriculum in European Geography Education. Similarities and differences in the United Europe.

[8] Big Data: Factor Models

Data Preprocessing. Cluster Similarity

Weighted Voting Games

Regularized Discriminant Analysis and Reduced-Rank LDA

Classification of high dimensional data: High Dimensional Discriminant Analysis

Unsupervised Learning

STATISTICAL LEARNING SYSTEMS

SC4/SM4 Data Mining and Machine Learning Clustering

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Multivariate Regression

Export Destinations and Input Prices. Appendix A

Feature Engineering, Model Evaluations

Bathing water results 2011 Slovakia

Bayesian Decision and Bayesian Learning

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Weekly price report on Pig carcass (Class S, E and R) and Piglet prices in the EU. Carcass Class S % + 0.3% % 98.

Introduction to Graphical Models

Principal Component Analysis

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Transcription:

Prof. Dr. J. Franke All of Statistics 3.1 Multivariate Analysis High dimensional data X 1,..., X N, i.i.d. random vectors in R p. As a data matrix X: objects values of p features 1 X 11 X 12... X 1p 2. X 21. X 22.... X 2p. N X N1 X N2... X Np X T j = (X j1,..., X jp ) = j th row of X. Many statistical problems and procedures (estimators, tests,...) similar to dimension 1, but some specific for multivariate data, e.g. dimension reduction, finding few relevant features,...

Prof. Dr. J. Franke All of Statistics 3.2 Ideally: X 1,..., X N i.i.d. multivariate normal N p (µ, Σ) with mean vector µ and covariance matrix Σ µ k = EX jk, Σ kl = cov(x jk, X jl ), k, l = 1,..., p. Parameter estimates ˆµ = X N = 1 N Nj=1 X j ˆΣ = S = 1 N N j=1 (X j X N )(X j X N ) T, i.e. S kl = 1 N N j=1 (X jk ˆµ k )(X jl ˆµ l ) S is symmetric, all eigenvalues 0. Let d 1 d 2... d p 0 be the eigenvalues of S S = ODO T, D = diag(d 1,..., d p )

Prof. Dr. J. Franke All of Statistics 3.3 S = ODO T, D = diag(d 1,..., d p ) O orthogonal matrix (basis transformation), columns of O = orthogonal basis of eigenvectors of S. Analogously: δ 1 δ 2... δ p 0 eigenvalues of Σ = diag(δ 1,..., δ p ) = Ω T ΣΩ Σ = Ω Ω T Principal Component Analysis (PCA) X 0 representative of X 1,..., X N principal component transformation: X 0 W = Ω T (X 0 µ) k th principal component of X 0 : W k = X 0 µ, e k where e 1,..., e p = normed eigenvectors of Σ = columns of Ω

Prof. Dr. J. Franke All of Statistics 3.4 W = p k=1 W ke k representation of X 0 µ in eigenbasis of Σ. Properties: EW k = 0, var W k = δ k, cov(w k, W l ) = 0, k l var W 1 var W 2... var W p If X 0 N p (µ, Σ)-distributed, then W N p (0, )-distributed, and, hence, W 1,..., W p independent! var W 1 = max{var U; U = p k=1 α k X 0k, α 1,..., α p R, p k=1 α 2 k = 1} Idea of principal component analysis (dimension reduction): Find q p linear combinations of X 01,..., X 0p which explain a large percentage of the variability in the features.

Prof. Dr. J. Franke All of Statistics 3.5 Solution: W 1,..., W q Open questions: µ, Σ =?, q =? Empirical principal components Sample X 1,..., X N i.i.d., sample covariance matrix S = ODO T, sample mean ˆµ = X N principal component transformation: X j V j = O T (X j X N ) Use features V j1,..., V jq instead of X j1,..., X jp, j = 1,..., N. Selection of q : d 1 d 2... d p eigenvalues of S. The proportion of variability of X 0 explained by W 1,..., W q is δ 1 +... + δ q, estimated by d 1 +... + d q.

Prof. Dr. J. Franke All of Statistics 3.6 scree graph: plot d k or d k p l=1 d l against k choose that q where the graph becomes flat. Rules of thumb: a) Choose q s.th. 90% of total variability explained: q = min{r p; d 1 +... + d r 0.9 p k=1 d k = 0.9 tr S} b) (Kaiser) Consider only principal components with above average variance: q = max{r p; d r 1 p p k=1 d k }

Prof. Dr. J. Franke All of Statistics 3.7 N = 180 pit props cut from Corsican pine (Jeffreys, 1967) Goal (regression): Y j = maximum compressive strength as a function of p = 13 predictor variables: 1: top diameter 2: length 3: moisture content (% of dry weight) 4: specific gravity at test 5: oven-dry specific gravity of timber 6: no. of annual rings: at top 7:...: at base 8: maximum bow 9: distance top to point of maximum bow 10: no. of knot whorls 11: length of clear prop from top 12: average no. of knots per whorl 13: average diameter of knots Principal component analysis for X j1,..., X jp, j = 1,..., N

Prof. Dr. J. Franke All of Statistics 3.8 scree graph (k, d k ) for Corsican pitprop data

Prof. Dr. J. Franke All of Statistics 3.9 Eigenvector w.r.t. d 1 : e 1 = ( 0.40, 0.41, 0.12, 0.17, 0.06, 0.28, 0.40, 0.29, 0.36, 0.38, 0.01, 0.12, 0.11) T V j1 average of X jk, k = 1, 2, 6-10 total size of pit prop V j2 average of X jk, V j3 average of X jk, k = 3, 4 degree of seasoning k = 4-7 speed of growth V j4 X j11 length of clear prop from top V j5 X j12 average no. of knots per whorl V j6 average of X jk, k = 5, 13 Rules of thumb: a) q = 6 b) q = 4 Scree graph q = 3 or q = 6

Prof. Dr. J. Franke All of Statistics 3.10 Discriminant Analysis Classification problem: object from class C 1,..., C m Observed: feature vector X 0 Ass.: X 0 has density f k (x) if object from C k Bayes classifier: decide for class C k if X 0 {x; f k (x) = max f i(x)} (1) i=1,...,m Gaussian case: if object from C k, X 0 is N p (µ k, Σ)-distributed. Then, (1) is equivalent to X 0 µ k 2 Σ = (X 0 µ k ) T Σ 1 (X 0 µ k ) =. Σ = Mahalanobis distance w.r.t. Σ min i=1,...,m X 0 µ i 2 Σ

Prof. Dr. J. Franke All of Statistics 3.11 Special case m = 2 : decide for C 1 if α T (X 0 µ) > 0 with µ = 1 2 (µ 1 + µ 2 ), α = Σ 1 (µ 1 µ 2 ). In practice, µ 1,..., µ m, Σ unknown. Training set X (k) j, j = 1,..., n k, i.i.d. N p (µ k, Σ), k = 1,..., m, with known classification sample means and sample covariance matrices for each subsample ˆµ k = 1 n k n k j=1 X (k) j, S (k), k = 1,..., m Combine S (1),..., S (m) to estimate of Σ S = 1 N m k=1 n k S (k), N = n 1 +... + n m

Prof. Dr. J. Franke All of Statistics 3.12 empirical classification rule: decide for C k if X 0 ˆµ k 2 S = min i=1,...,m X 0 ˆµ i 2 S Warning: If µ 1 =... = µ m, classification meaningless, but ˆµ 1,..., ˆµ m not equal. Safeguard: Test H 0 : µ 1 =... = µ m (multivariate ANOVA) Fisher s discriminant rule No Gaussian assumption consider only linear discriminant functions, i.e. decide for C k if a T (X 0 ˆµ k ) < a T (X 0 ˆµ i ) for all i k.

Prof. Dr. J. Franke All of Statistics 3.13 a T (X 0 ˆµ k ) < a T (X 0 ˆµ i ) for all i k. Choose a s.th. the discriminant function shows maximal differences between the groups (of a training set) a is an eigenvector w.r.t. largest eigenvalue of S 1 B where B = X N = 1 N m k=1 m n k (ˆµ k X N )(X N ˆµ k ) T, k=1 n kˆµ k = 1 N m n k k=1 j=1 X (k) j. For m = 2, Fisher s rule coincides with Bayes classification for Gaussian data. For m > 2, usually not.

Prof. Dr. J. Franke All of Statistics 3.18 Cluster analysis Discriminant analysis: Given classes C 1,..., C m. Find classification rule. Estimate it based on training set (supervised learning). Cluster analysis: Find appropriate classes! No information about which object belongs to which group (unsupervised learning). Observed feature vectors X 1,..., X N R p, independent. Assume, e.g., that all X j are N p (µ k, Σ k )-distributed if X j belongs to class C k for some unknown µ k, Σ k, k = 1,..., m, and unknown m, C 1,..., C m.

Prof. Dr. J. Franke All of Statistics 3.19 Maximum likelihood over µ k, Σ k, C 1,..., C m and m in principle possible with penalty for large m to avoid overfitting computationally not feasible usually. Hierarchical clustering algorithms Needed: distance between objects i, j common choice: Pearson distance of feature vectors d 2 ij = p k=1 s 2 k = 1 N 1 (X ik X jk ) 2, N s 2 k j=1(x jk ˆµ k ) 2, ˆµ k = 1 N N X jk j=1 Agglomerative: Start with maximal number of clusters, i.e. each object is an own cluster: C (0) j = {j}, j = 1,..., N.

Prof. Dr. J. Franke All of Statistics 3.20 Nearest neighbour single linkage (nnsl) sort d ij, i < j d r1 s 1 d r2 s 2... 1) N 1 clusters C (0) r 1 + C (0) s 1, C (0) j, j r 1, s 1 2) If r 1, s 1 r 2, s 2 N 2 clusters {r 1, s 1 }, {r 2, s 2 }, {j} j r 1, s 1, r 2, s 2. If r 1 = r 2, s 1 s 2 N 2 clusters {r 1, s 1, s 2 }, {j}, j r 1, s 1, s 2. If r 1 r 2, s 1 = s 2... l) join cluster containing r l with cluster containing s l Stop if d rl,s l > threshold d 0.

Prof. Dr. J. Franke All of Statistics 3.21 Average linkage D rs = distance between cluster r and cluster s 0) D (0) ij = d ij for C (0) j = {j} 1) As in nnsl, distance of new cluster C r (0) 1 + C s (0) 1 C (0) j, j r 1, s 1 : to cluster D (1) 1j = 1 2 (d r 1 j + d s1 j) 2) Join the two clusters with minimal distance D r (i 1) s. Define distance of new cluster to the other clusters by D (i) = 1 2 (D(i 1) r k + D (i 1) s k ), k r, s 3) Stop if all cluster distances > d 0.

Prof. Dr. J. Franke All of Statistics 3.22 Data: Protein consumption in N = 25 European countries for p = 9 food groups: 1. Red meat 2. White meat 3. Eggs 4. Milk 5. Fish 6. Cereals 7. Starchy foods 8. Pulses, nuts, and oil-seeds 9. Fruits and vegetables

Prof. Dr. J. Franke All of Statistics 3.23 Complete linkage cluster analysis: Eastern Europe (blue): East Germany, Czechoslovakia, Poland, USSR, Hungary; Scandinavia (green): Sweden, Denmark, Norway, Finland; Western Europe (red): UK, France, West Germany, Belgium, Ireland, Netherlands, Austria, Switzerland; Iberian (purple) Spain, Portugal; Mediterranean (orange) Italy, Greece; the Balkans (yellow) Yugoslavia, Romania, Bulgaria, Albania. PCA: dimension reduction to q = 4 principal components: 1: total meat consumptiom 2-4: Consumption of red meat, white meat resp. fish Weber, A. (1973) Agrarpolitik im Spannungsfeld der internationalen Ernährungspolitik, Institut für Agrarpolitik und Marktlehre, Kiel. Data for download: http://lib.stat.cmu.edu/dasl/stories/proteinconsumptionineurope.html

Prof. Dr. J. Franke All of Statistics 3.24