Multivariate Statistics

Similar documents
Multivariate Statistics

Multivariate Analysis

Multivariate Statistics

New Educators Campaign Weekly Report

Standard Indicator That s the Latitude! Students will use latitude and longitude to locate places in Indiana and other parts of the world.

Jakarta International School 6 th Grade Formative Assessment Graphing and Statistics -Black

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Challenge 1: Learning About the Physical Geography of Canada and the United States

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

, District of Columbia

Correction to Spatial and temporal distributions of U.S. winds and wind power at 80 m derived from measurements

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee May 2018

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee October 2017

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee October 2018

Abortion Facilities Target College Students

Preview: Making a Mental Map of the Region

Hourly Precipitation Data Documentation (text and csv version) February 2016

Outline. Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining

Additional VEX Worlds 2019 Spot Allocations

QF (Build 1010) Widget Publishing, Inc Page: 1 Batch: 98 Test Mode VAC Publisher's Statement 03/15/16, 10:20:02 Circulation by Issue

Last time: PCA. Statistical Data Mining and Machine Learning Hilary Term Singular Value Decomposition (SVD) Eigendecomposition and PCA

Multivariate Classification Methods: The Prevalence of Sexually Transmitted Diseases

A. Geography Students know the location of places, geographic features, and patterns of the environment.

2005 Mortgage Broker Regulation Matrix

Intercity Bus Stop Analysis

Printable Activity book

Online Appendix: Can Easing Concealed Carry Deter Crime?

Chapter. Organizing and Summarizing Data. Copyright 2013, 2010 and 2007 Pearson Education, Inc.

Club Convergence and Clustering of U.S. State-Level CO 2 Emissions

JAN/FEB MAR/APR MAY/JUN

SUPPLEMENTAL NUTRITION ASSISTANCE PROGRAM QUALITY CONTROL ANNUAL REPORT FISCAL YEAR 2008

Office of Special Education Projects State Contacts List - Part B and Part C

What Lies Beneath: A Sub- National Look at Okun s Law for the United States.

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984.

North American Geography. Lesson 2: My Country tis of Thee

Osteopathic Medical Colleges

Summary of Natural Hazard Statistics for 2008 in the United States

High School World History Cycle 2 Week 2 Lifework

RELATIONSHIPS BETWEEN THE AMERICAN BROWN BEAR POPULATION AND THE BIGFOOT PHENOMENON

Alpine Funds 2017 Tax Guide

Alpine Funds 2016 Tax Guide

Crop Progress. Corn Mature Selected States [These 18 States planted 92% of the 2017 corn acreage]

LABORATORY REPORT. If you have any questions concerning this report, please do not hesitate to call us at (800) or (574)

LABORATORY REPORT. If you have any questions concerning this report, please do not hesitate to call us at (800) or (574)

Meteorology 110. Lab 1. Geography and Map Skills

BlackRock Core Bond Trust (BHK) BlackRock Enhanced International Dividend Trust (BGY) 2 BlackRock Defined Opportunity Credit Trust (BHL) 3

MINERALS THROUGH GEOGRAPHY

extreme weather, climate & preparedness in the american mind

Grand Total Baccalaureate Post-Baccalaureate Masters Doctorate Professional Post-Professional

OUT-OF-STATE 965 SUBTOTAL OUT-OF-STATE U.S. TERRITORIES FOREIGN COUNTRIES UNKNOWN GRAND TOTAL

JAN/FEB MAR/APR MAY/JUN

An Analysis of Regional Income Variation in the United States:

Grand Total Baccalaureate Post-Baccalaureate Masters Doctorate Professional Post-Professional

MINERALS THROUGH GEOGRAPHY. General Standard. Grade level K , resources, and environmen t

Insurance Department Resources Report Volume 1

Office of Budget & Planning 311 Thomas Boyd Hall Baton Rouge, LA Telephone 225/ Fax 225/

GIS use in Public Health 1

United States Geography Unit 1

National Organization of Life and Health Insurance Guaranty Associations

Infant Mortality: Cross Section study of the United State, with Emphasis on Education

KS PUBL 4YR Kansas State University Pittsburg State University SUBTOTAL-KS

FLOOD/FLASH FLOOD. Lightning. Tornado

Rank University AMJ AMR ASQ JAP OBHDP OS PPSYCH SMJ SUM 1 University of Pennsylvania (T) Michigan State University

Pima Community College Students who Enrolled at Top 200 Ranked Universities

A Summary of State DOT GIS Activities

Non-iterative, regression-based estimation of haplotype associations

DOWNLOAD OR READ : USA PLANNING MAP PDF EBOOK EPUB MOBI

MO PUBL 4YR 2090 Missouri State University SUBTOTAL-MO

All-Time Conference Standings

112th U.S. Senate ACEC Scorecard

Statistical Methods for Data Mining

Stem-and-Leaf Displays

Fungal conservation in the USA

Introducing North America

Evidence for increasingly extreme and variable drought conditions in the contiguous United States between 1895 and 2012

October 2016 v1 12/10/2015 Page 1 of 10

Office of Budget & Planning 311 Thomas Boyd Hall Baton Rouge, LA Telephone 225/ Fax 225/

Chapter 11 : State SAT scores for 1982 Data Listing

Green Building Criteria in Low-Income Housing Tax Credit Programs Analysis

This project is supported by a National Crime Victims' Right Week Community Awareness Project subgrant awarded by the National Association of VOCA

Package ZIM. R topics documented: August 29, Type Package. Title Statistical Models for Count Time Series with Excess Zeros. Version 1.

A GUIDE TO THE CARTOGRAPHIC PRODUCTS OF

St. Luke s College Institutional Snapshot

Division I Sears Directors' Cup Final Standings As of 6/20/2001

Earl E. ~rabhl/ This report is preliminary and has not been reviewed for conformity with U.S. Geological Survey editoral standards GEOLOGICAL SURVEY

Chapter 5 Linear least squares regression

Direct Selling Association 1

Physical Features of Canada and the United States

Physical Features of Canada and the United States

Summary of Terminal Master s Degree Programs in Philosophy

PHYSICS BOWL - APRIL 22, 1998

CCC-A Survey Summary Report: Number and Type of Responses

Crop / Weather Update

If you have any questions concerning this report, please feel free to contact me. REPORT OF LABORATORY ANALYSIS

State Section S Items Effective October 1, 2017

BOWL - APRIL 27, QUESTIONS 45 MINUTES

National Council for Geographic Education Curriculum & Instruction Committee Geography Club Submitted by: Steve Pierce

If you have any questions concerning this report, please feel free to contact me. REPORT OF LABORATORY ANALYSIS

The Heterogeneous Effects of the Minimum Wage on Employment Across States

2011 NATIONAL SURVEY ON DRUG USE AND HEALTH

Transcription:

Multivariate Statistics Chapter 3: Principal Component Analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical Engineering Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 1 / 45

1 Introduction 2 Principal component analysis 3 Normalized principal component analysis 4 Principal component analysis in practice Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 2 / 45

Introduction Suppose that we have a data matrix X with dimension n p. A central problem in multivariate data analysis is the curse of dimensionality: if the ratio n/p is not large enough, some problems might be intractable. For example, assume that we have a sample of size n from a p-dimensional random variable following a N(µ x, Σ x ) distribution. In this case, the number of parameters to estimate is p + p(p + 1)/2. For instance, for p = 5 and p = 10, there are 20 and 65 parameters, respectively. Thus, the larger p, the larger number of observations we need to obtain reliable estimates of the parameters. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 3 / 45

Introduction There are several dimension reduction techniques that try to answer the same question: Is it possible to describe with accuracy the values of p variables with a smaller number r < p of new variables? We are going to see in this chapter principal component analysis. Next chapter is devoted to factor analysis. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 4 / 45

Introduction As mentioned before, the main objective of principal component analysis (PCA) is to reduce the dimension of the problem. The simplest way of dimension reduction is to take just some variables of the observed multivariate random variable x = (x 1,..., x p ) and to discard all others. However, this is not a very reasonable approach since we loss all the information contained in the discarded variables. Principal component analysis is a flexible approach based on a few linear combinations of the original (centered) variables in x = (x 1,..., x p ). The p components of x are required to reproduce the total system variability. However much of the variability of x can be accounted for a small number of r < p of principal components. If so, there is almost as much information in the r principal components as there is in the original p variables contained in x. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 5 / 45

Introduction The general objective of PCA is dimension reduction. However, PCA is a powerful method to interpret the relationship between the univariate variables that form x = (x 1,..., x p ). Indeed, a PCA often reveals relationships that were not previously suspected and thereby allows interpretations that would not ordinarily results. It is important to note that a PCA is more of a means to an end rather than an end in themselves, because they frequently serve as intermediate steps in much larger investigations. As we shall see, principal components depend solely on the covariance (or correlation) matrix of x. Therefore, their development does not require a multivariate Gaussian assumption. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 6 / 45

Principal component analysis Given a multivariate random variable x = (x 1,..., x p ) with mean µ x and covariance matrix Σ x, the principal components are contained in a new multivariate random variable of dimension r p given by: z = A (x µ x ) where A is a certain p r matrix whose columns are p 1 vectors a j = (a j1,..., a jp ), for j = 1,..., r. Therefore, z = (z 1,..., z r ) is a linear transformation of x given by the r linear combinations a j, for j = 1,..., r. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 7 / 45

Principal component analysis Consequently: E (z j ) = 0 Var (z j ) = a jσ x a j Cov (z j, z k ) = a jσ x a k for j, k = 1,..., r. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 8 / 45

Principal component analysis In particular, among all the possible linear combinations of the variables in x = (x 1,..., x p ), the principal components are those that simultaneously verifies the following two properties: 1 The variances Var (z j ) = a jσ xa j are as large as possible. 2 The covariances Cov (z j, z k ) = a jσ xa k are 0, implying that the principal components of x, i.e., the variables in z = (z 1,..., z r ), are uncorrelated. Next, we formally derive the linear combinations a j, for j = 1,..., r that leads to the PC s. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 9 / 45

Principal component analysis Assume that the covariance matrix Σ x have a set of p eigenvalues λ 1,..., λ p with associated eigenvectors v 1,..., v p. Then, the first principal component corresponds to the linear combination with maximum variance. In other words, the first PC corresponds to the linear combination that maximizes Var (z 1 ) = σ 2 z 1 = a 1 Σ xa 1. However, it is clear that a 1 Σ xa 1 can be increased by multiplying any a 1 with some constant. To eliminate this indeterminacy, it is convenient to restrict attention to coefficient vector of unit length, i.e., we assume that a 1 a 1 = 1. The following linear combinations are obtained with a similar argument but adding the property that they are uncorrelated with the previous ones. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 10 / 45

Principal component analysis Therefore, we define: First principal component= arg max s.t. a 1 a1=1 a 1Σ x a 1 Second principal component= arg max a 2Σ x a 2 s.t. a 2 a2=1, a 1 Σx a2=0 r-th principal component=. arg max s.t. a r ar =1, a r Σx a k =0, for k<r a r Σ x a r Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 11 / 45

Principal component analysis The first problem can be solved with the Lagrange multiplier method as follows. Let: Then, M = a 1S x a 1 β 1 (a 1a 1 1) M a 1 = 2Σ x a 1 2β 1 a 1 = 0 Σ x a 1 = β 1 a 1 Therefore, a 1 is an eigenvector of Σ x and β 1 is its corresponding eigenvalue. Which ones? Multiplying by a 1 in the last expression, we get: σ 2 z 1 = a 1Σ x a 1 = β 1 a 1a 1 = β 1 As σ 2 z 1 should be maximal, β 1 corresponds to the largest eigenvalue of Σ x (β 1 = λ 1 ) and a 1 is its associated eigenvector (a 1 = v 1 ). Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 12 / 45

Principal component analysis The second principal component is given by: Second principal component= arg max a 2Σ x a 2 s.t. a 2 a2=1, a 1 Σx a2=0 where a 1 = v 1, the eigenvector associated with the largest eigenvalue of the covariance matrix Σ x. Therefore, Σ x v 1 = λ 1 v 1, so that: a 1Σ x a 2 = λ 1 v 1a 2 = 0 Following the reasoning for the first principal component, we define: M = a 2Σ x a 2 β 2 (a 2a 2 1) Then, M a 2 = 2Σ x a 2 2β 2 a 2 = 0 Σ x a 2 = β 2 a 2 Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 13 / 45

Principal component analysis Therefore, a 2 is an eigenvector of Σ x and β 2 is its corresponding eigenvalue. Which ones? Multiplying by a 2 in the last expression, σ 2 z 2 = a 2Σ x a 2 = β 2 a 2a 2 = β 2 As σz 2 2 should be maximal, β 2 corresponds to the second largest eigenvalue of Σ x (β 2 = λ 2 ) and a 2 is its associated eigenvector (a 2 = v 2 ). This argument can be extended for successive principal components. Therefore, the r principal components corresponds to the eigenvectors of the covariance matrix Σ x associated with the r largest eigenvalues. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 14 / 45

Principal component analysis In summary, the principal components are given by: z = V r (x µ x ) where V r is a p r orthogonal matrix (i.e., V r V r = I r and V r V r = I p ) whose columns are the first r eigenvectors of Σ x. The covariance matrix of z, Σ z, is the diagonal matrix with elements λ 1,..., λ r, i.e., the first r eigenvalues of Σ x. Therefore, the usefulness of the principal components is two-fold: It allows for an optimal representation, in a space of reduced dimensions, of the original observations. It allows the original correlated variables to be transformed into new uncorrelated variables, facilitating the interpretation of the data. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 15 / 45

Principal component analysis Note that we can compute the principal components even for r = p. Taking this into account, it is easy to check that the PC s verifies the following properties: for j, k = 1,..., p. E (z j ) = 0 Var (z j ) = λ j Cov (z j, z k ) = 0 Var (z 1 ) Var (z 2 ) Var (z p ) 0 p p Var (z j ) = Tr (Σ x ) = Var (x j ) j=1 j=1 p Var (z j ) = Σ x j=1 Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 16 / 45

Principal component analysis In particular, note that the fourth property ensures that the set of p principal components conserve the initial variability. Therefore, a measure of how well the r-th PC explains variation is given by the proportion of variability explained by r-th PC is given by: λ r PV r = λ 1 + + λ p r = 1,..., p Additionally, a measure of how well the first r PCs explain variation is given by the accumulated proportion of variability explained by the first r PCs is given by: APV r = λ 1 + + λ r λ 1 + + λ p r = 1,..., p Therefore, if most (for instance, 80% or 90%) of the total variability can be attributed to the first one, two or three principal components, then these components can replace the original p variables without much loss of information. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 17 / 45

Principal component analysis The covariance between the principal components and the original variables can be written as follows: Cov (z, x) = E [ z (x µ x ) ] = E [ V r (x µ x ) (x µ x ) ] = = V r E [ (x µ x ) (x µ x ) ] = V r Σ x Now, the singular value decomposition of Σ x is given by Σ x = V p Λ p V p, where Λ p is the diagonal matrix that contains the p eigenvalues of Σ x in decreasing order and V p is the matrix that contains the p associated eigenvectors of Σ x. Therefore: Cov (z, x) = V r V p Λ p V p = Λ r V r where Λ r is the diagonal matrix that contains the r largest eigenvalues of Σ x in decreasing order. In particular, note that Λ r = Σ z. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 18 / 45

Principal component analysis On the other hand, the correlation between between the principal components and the original variables can be written as follows: Cor (z, x) = Σ 1/2 z E [ z (x µ x ) ] Dx 1/2 = Σ 1/2 z V r Σ x Dx 1/2 where D x is a p p diagonal matrix whose elements are the variances in the main diagonal of Σ x. Now, replacing Σ x with V p Λ p V p and Σ z with Λ r : Cor (z, x) = Λ 1/2 r V r V p Λ p V pd x 1/2 = Λ 1/2 r V r Dx 1/2 because Λ 1/2 r V r V p Λ p V p = Λ 1/2 r V r. The correlations of the principal components and the variables often help to interpret the components as they measure the contribution of each individual variable to each principal component. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 19 / 45

Normalized principal component analysis One problem with principal components is that they are not scale-invariant because if we change the units of the variables, the covariance matrix of the transformed variables will also change. Additionally, if there are large differences between the variances of the original variables, then those whose variances are largest will tend to dominate the early components. In these circumstances, it is better first to standardize the variables. In other words, the principal components should only be extracted from the original variables when all of them have the same scale. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 20 / 45

Normalized principal component analysis Therefore, if the variables have different units of measurement, we define y = Dx 1/2 (x µ x ) as the univariate standardized original variables and obtain the PCs from y. For that, note that: and Cov (y) = E [yy ] = E E (y) = 0 r [ ] Dx 1/2 (x µ x ) (x µ x ) Dx 1/2 = = Dx 1/2 Σ x Dx 1/2 = ϱ x i.e., the covariance matrix of y is the correlation matrix of x. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 21 / 45

Normalized principal component analysis Consequently, the principal components of y should be obtained from the eigenvectors of the correlation matrix of x, denoted by v ϱ 1,..., v ϱ r, with associated eigenvalues λ ϱ 1 λϱ r 0. In particular, these are given by: z ϱ = (V ϱ r ) y = (V ϱ r ) D 1/2 x (x µ x ) where Vr ϱ = [v ϱ 1 v r ϱ ] is the r r matrix that contains the eigenvectors of the correlation matrix ϱ x. The PCs obtained in this way are called the normalized principal components. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 22 / 45

Normalized principal component analysis All the previous results apply, with some simplifications as the variance of the univariate standardized variables is 1. Therefore, the normalized PC s verifies the following properties: ( ) E z ϱ j = 0 ( ) Var z ϱ j = λ ϱ j ( ) Cov z ϱ j, zϱ k = 0 for j, k = 1,..., p. Var (z ϱ 1 ) Var (zϱ 2 ) Var ( zp ϱ ) 0 p ( ) Var z ϱ j = Tr (Σ y ) = Tr (ϱ x ) = r j=1 p Var j=1 ( ) z ϱ j = ϱ x Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 23 / 45

Normalized principal component analysis For normalized PCs, the proportion of variability explained by r-th principal component is given by: PV ϱ r = λϱ r p Similarly, for normalized PCs, the accumulated proportion of variability explained by the first r principal components based is given by: PV ϱ r = λϱ 1 + + λϱ r p Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 24 / 45

Normalized principal component analysis Additionally, it is possible to show that the covariance between the principal components based on the standardized variables and the original variables can be written as follows: Cov(z ϱ, x) = (Vr ϱ ) D 1/2 Σ x Moreover, the correlation between the principal components based on the standardized variables and the original variables can be written as follows: Cor(z ϱ, y) = (Λ ϱ r ) 1/2 (V ϱ r ) Here, Λ ϱ r is the covariance matrix of z, denoted by Σ ϱ z, that is a r r diagonal matrix that contains the eigenvalues of ϱ x, λ ϱ 1,..., λϱ r. x Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 25 / 45

Principal component analysis in practice In practice, one replace the population quantities with their corresponding sample counterparts based on the data matrix X of dimension n p. The principal component scores are the values of the new variables. If we use the sample covariance matrix to obtain the principal components, the data matrix that contains the principal component scores is given by: Z = X V Sx r where Vr Sx is the matrix that contains the eigenvectors of the sample covariance matrix S x linked with the r largest eigenvalues. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 26 / 45

Principal component analysis in practice On the other hand, if we use the sample correlation matrix to obtain the principal components, the data matrix that contains the principal component scores is given by: Z = YVr Rx = X D 1/2 V Rx where Vr R is the matrix that contains the eigenvectors of the sample correlation matrix R x linked with the r largest eigenvalues and D Sx is the diagonal matrix that contains the sample variances of the components of X. S x r Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 27 / 45

Principal component analysis in practice Different rules have been suggested for selecting the number of components r for data sets: 1 Plot a graph of 1,..., p against λ 1,..., λ p (the scree plot): The idea is to exclude of the analysis those components associated with small values and that approximately the same size. 2 Select components until a certain proportion of the variance has been covered, such as 80% or 90%: This rule should be applied with caution because sometimes a single component picks up most of the variability, whereas there might be other components with interesting interpretations. 3 Discard those components associated with eigenvalues of less than a certain value such as the mean of the eigenvalues of the sample covariance or correlation matrix: Again, this rule is arbitrary: a variable that is independent from the rest usually accounts for a principal component and can have a large eigenvalue. 4 Use asymptotic results on the estimated eigenvalues of the covariance or correlation matrices used to derive the PCs. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 28 / 45

Illustrative example I Consider the following eight univariate variables measured on the 50 states of the USA: x1: population estimate as of July 1, 1975 (in thousands). x2: per capita income (1974) (in dollars). x3: illiteracy (1970, percent of population). x4: life expectancy in years (1969 71). x5: murder and non-negligent manslaughter rate per 100000 population (1976). x6: percent high-school graduates (1970). x7: mean number of days with minimum temperature below freezing (1931-1960) in capital or large city. x8: land area in square miles. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 29 / 45

Illustrative example I Original dataset 3000 4000 5000 6000 68 69 70 71 72 73 40 45 50 55 60 65 0e+00 2e+05 4e+05 Population 0 15000 3000 5000 Income Illiteracy 0.5 2.0 68 71 Life.Exp Murder 2 6 12 40 55 HS.Grad Frost 0 100 0e+00 5e+05 0 5000 15000 0.5 1.0 1.5 2.0 2.5 2 4 6 8 10 12 14 0 50 100 150 Area Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 30 / 45

Illustrative example I Sx = The sample mean vector for the data is given by: x = (4246.42, 4435.80, 1.17, 70.87, 7.37, 53.10, 104.46, 70735.88) The sample covariance matrix is given by: 19.93 10 6 57.12 10 3 292.86 407.84 5663.52 3551.50 77.08 10 3 8.58 10 6 57.12 10 3 37 10 4 163.70 280.66 521.89 3076.76 7227.60 1.90 10 7 292.86 163.70 0.37 0.48 1.58 3.23 21.29 4.01 10 3 407.84 280.66 0.48 1.80 3.86 6.31 18.28 1.22 10 4 5663.52 521.89 1.58 3.86 13.62 14.54 103.40 7.19 10 4 3551.50 3076.76 3.23 6.31 14.54 65.23 153.99 2.29 10 5 77.08 10 3 7227.60 21.29 1828 103.40 153.99 2702.00 2.62 10 5 85.87 10 5 1.90 10 7 4.01 10 3 1.22 10 4 7.19 10 4 2.29 10 5 2.62 10 5 7.28 10 9 Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 31 / 45

Illustrative example I The eigenvectors of S x are the columns of the V Sx 8 matrix given by: 0.00 0.99 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.99 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.03 0.02 0.99 V Sx 0.00 0.00 0.00 0.00 0.11 0.28 0.95 0.01 8 = 0.00 0.00 0.00 0.02 0.23 0.92 0.30 0.04 0.00 0.00 0.00 0.02 0.96 0.26 0.04 0.03 0.00 0.00 0.00 0.98 0.03 0.01 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Note that the first eigenvector is associated with the last variable which is the one with largest sample variance and so on with the other ones. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 32 / 45

Illustrative example I The eigenvalues of S x are λ Sx 1 = 7.28 109, λ Sx 2 = 1.99 107, λ Sx 3 = 3.12 105, λ Sx 4 = 2.15 103, λ Sx 5 = 36.51, λsx 6 = 6.05, λsx 7 = 0.43 and λsx 8 = 0.08. The proportion of variability explained by the components are 0.997, 2.73 10 3, 4.28 10 5, 2.94 10 7, 5.00 10 9, 8.29 10 10, 5.93 10 11 and 1.15 10 11, respectively. Consequently, the first proportion of accumulated variability is 0.997, while the others are larger than 0.99999. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 33 / 45

Illustrative example I We standardize the data matrix. Therefore, we obtain eigenvalues and eigenvectors of the sample correlation matrix of the data set which is given by: 1 0.20 0.10 0.06 0.34 0.09 0.33 0.02 0.20 1 0.43 0.34 0.23 0.61 0.22 0.36 0.10 0.43 1 0.58 0.70 0.65 0.67 0.07 R x = 0.06 0.34 0.58 1 0.78 0.58 0.26 0.10 0.34 0.23 0.70 0.78 1 0.48 0.53 0.22 0.09 0.61 0.65 0.58 0.48 1 0.36 0.33 0.33 0.22 0.67 0.26 0.53 0.36 1 0.05 0.02 0.36 0.07 0.10 0.22 0.33 0.05 1 Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 34 / 45

Illustrative example I The eigenvectors of R x are the columns of the V Rx 8 matrix given by: 0.12 0.41 0.65 0.40 0.40 0.01 0.06 0.21 0.29 0.51 0.10 0.08 0.63 0.46 0.00 0.06 0.46 0.05 0.07 0.35 0.00 0.38 0.61 0.33 V Rx 0.41 0.08 0.35 0.44 0.32 0.21 0.25 0.52 8 = 0.44 0.30 0.10 0.16 0.12 0.32 0.29 0.67 0.42 0.29 0.04 0.23 0.09 0.64 0.39 0.30 0.35 0.15 0.38 0.61 0.21 0.21 0.47 0.02 0.03 0.58 0.51 0.20 0.49 0.14 0.28 0.01 The eigenvalues of R x are λ Rx 1 = 3.59, λ Rx 2 = 1.63, λ Rx 3 = 1.11, λ Rx 4 = 0.70, λ Rx 5 = 0.38, λ Rx 6 = 0.30, λ Rx 7 = 0.14 and λ Rx 8 = 0.11. The proportion of variability explained by the components are 0.449, 0.203, 0.138, 0.088, 0.048, 0.038, 0.018 and 0.014. Consequently, the proportion of accumulated variability are 0.449, 0.653, 0.792, 0.881, 0.929, 0.967, 0.985 and 1, respectively. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 35 / 45

Illustrative example I Scores 1 0 1 2 3 4 5 3 2 1 0 1 1.0 0.0 0.5 1.0 1.5 0.6 0.2 0.2 0.6 PC1 4 1 2 1 2 4 PC2 PC3 4 1 2 3 1 1 PC4 PC5 1.5 0.5 1.0 0.5 PC6 PC7 1.0 0.5 0.6 0.2 4 3 2 1 0 1 2 4 2 0 1 2 3 1.5 0.5 0.5 1.5 1.0 0.5 0.0 0.5 1.0 PC8 Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 36 / 45

Illustrative example I Screeplot Variances 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 37 / 45

Illustrative example I In this case, we can select the first three principal components as they explain the 79% of the total variability and the mean of the eigenvalues of the sample covariance or correlation matrix is 1 (λ Rx 3 = 1.11 and λ Rx 4 = 0.70). The first component is given by: z 1 = 0.12 x 1 + 0.29 x 2 0.46 x 3 + 0.41 x 4 0.44 x 5 + 0.42 x 6 + 0.35 x 7 + 0.03 x 8 The first principal component distinguishes between cold states with rich, longlived, and educated populations, from warm states with poor, short-lived, illeducated and violent states. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 38 / 45

Illustrative example I The second component is given by: z 2 = 0.41 x 1 + 0.51 x 2 + 0.05 x 3 0.08 x 4 + 0.30 x 5 + 0.29 x 6 0.15 x 7 + 0.58 x 8 The second principal component distinguishes big and populated states with rich and educated, although violent states, from small and low populated states with poor and ill-educated people. The third component is given by: z 3 = 0.65 x 1 + 0.10 x 2 0.07 x 3 + 0.35 x 4 0.10 x 5 0.04 x 6 0.38 x 7 0.51 x 8 The third principal component distinguishes populated states with rich and longlived populations from warm and big states that tends to be ill-educated and violent. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 39 / 45

Illustrative example I First and second principal component Second principal component 1 0 1 2 3 4 5 Alaska California Texas New York Florida Arizona Georgia Virginia Louisiana Alabama New Mexico North Carolina Tennessee South Carolina Mississippi Kentucky Arkansas West Virginia Illinois Nevada Michigan Maryland Ohio Washington Colorado Pennsylvania New Jersey Hawaii Oregon Missouri Montana Wyoming Kansas Indiana Massachusetts Connecticut Minnesota Delaware Oklahoma Idaho Wisconsin Nebraska Utah Iowa North Dakota Rhode Island Maine New Hampshire South Dakota Vermont 4 3 2 1 0 1 2 First principal component Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 40 / 45

Illustrative example I First and third principal component Third principal component 4 3 2 1 0 1 2 3 Louisiana Alabama South Carolina Mississippi Georgia Texas Tennessee North Carolina Arkansas Kentucky West Virginia Florida New Mexico New York Virginia Arizona 4 3 2 1 0 1 2 First principal component California Hawaii PennsylvaniaOhio New Jersey Massachusetts Illinois Washington Michigan Connecticut Oregon Maryland Wisconsin Indiana Rhode Island Kansas Iowa Minnesota Missouri Oklahoma Nebraska Delaware North Utah Dakota Vermont Idaho New Hampshire South Colorado Dakota Maine Nevada Alaska Montana Wyoming Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 41 / 45

Illustrative example I Second and third principal component Third principal component 4 3 2 1 0 1 2 3 Hawaii Florida Massachusetts Pennsylvania New Jersey Ohio Washington Illinois Connecticut Michigan Wisconsin Oregon Rhode Island Iowa Minnesota Indiana Maryland Kansas Virginia Tennessee Oklahoma Nebraska Missouri Arizona Arkansas Kentucky North Delaware Carolina North Dakota Utah Louisiana Alabama West Vermont New Virginia Hampshire Idaho Georgia South Dakota Colorado Maine South Carolina Mississippi New Mexico Montana Wyoming Nevada New York Texas California Alaska 1 0 1 2 3 4 5 Second principal component Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 42 / 45

Illustrative example I Finally, the correlation between the principal components based on the standardized variables and the original ones are given by: 0.23 0.56 0.88 0.78 0.84 0.80 0.67 0.06 Cor (z, x) = 0.52 0.66 0.06 0.10 0.39 0.38 0.19 0.75 0.69 0.10 0.07 0.37 0.11 0.05 0.40 0.53 Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 43 / 45

Chapter outline We are ready now for: Chapter 4: Factor Analysis Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 44 / 45

1 Introduction 2 Principal component analysis 3 Normalized principal component analysis 4 Principal component analysis in practice Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 45 / 45