Principal Components. Summary. Sample StatFolio: pca.sgp

Similar documents
Factor Analysis. Summary. Sample StatFolio: factor analysis.sgp

Canonical Correlations

Multiple Variable Analysis

Polynomial Regression

Correspondence Analysis

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials.

Multivariate T-Squared Control Chart

How to Run the Analysis: To run a principal components factor analysis, from the menus choose: Analyze Dimension Reduction Factor...

Statistical Tools for Multivariate Six Sigma. Dr. Neil W. Polhemus CTO & Director of Development StatPoint, Inc.

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp

Distribution Fitting (Censored Data)

Item Reliability Analysis

Box-Cox Transformations

Hypothesis Testing for Var-Cov Components

Factor analysis. George Balabanis

Noise & Data Reduction

DOE Wizard Screening Designs

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.

How To: Deal with Heteroscedasticity Using STATGRAPHICS Centurion

42 GEO Metro Japan

Automatic Forecasting

Relate Attributes and Counts

Or, in terms of basic measurement theory, we could model it as:

Probability Plots. Summary. Sample StatFolio: probplots.sgp

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS

Principal Components Analysis using R Francis Huang / November 2, 2016

VAR2 VAR3 VAR4 VAR5. Or, in terms of basic measurement theory, we could model it as:

Arrhenius Plot. Sample StatFolio: arrhenius.sgp

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Didacticiel - Études de cas

Applied Multivariate Analysis

Seasonal Adjustment using X-13ARIMA-SEATS

Linear Models II. Chapter Key ideas

PRINCIPAL COMPONENTS ANALYSIS

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

Unconstrained Ordination

Multivariate Capability Analysis Using Statgraphics. Presented by Dr. Neil W. Polhemus

Principal Components Analysis (PCA)

2/26/2017. This is similar to canonical correlation in some ways. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Using Tables and Graphing Calculators in Math 11

Noise & Data Reduction

L i s t i n g o f C o m m a n d F i l e

Package rela. R topics documented: February 20, Version 4.1 Date Title Item Analysis Package with Standard Errors

Using SPSS for One Way Analysis of Variance

How To: Analyze a Split-Plot Design Using STATGRAPHICS Centurion

E X P L O R E R. R e l e a s e A Program for Common Factor Analysis and Related Models for Data Analysis

Chapter 19: Logistic regression

MATH 644: Regression Analysis Methods

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Multivariate Fundamentals: Rotation. Exploratory Factor Analysis

LOOKING FOR RELATIONSHIPS

PRINCIPAL COMPONENTS ANALYSIS (PCA)

Multiple Linear Regression

Principal Component Analysis (PCA) Theory, Practice, and Examples

1 A Review of Correlation and Regression

Dimensionality Reduction Techniques (DRT)

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

UCLA STAT 233 Statistical Methods in Biomedical Imaging

Chapter 4: Factor Analysis

y response variable x 1, x 2,, x k -- a set of explanatory variables

B. Weaver (18-Oct-2001) Factor analysis Chapter 7: Factor Analysis

Advanced Quantitative Data Analysis

Principal Component Analysis & Factor Analysis. Psych 818 DeShon

STAT 501 EXAM I NAME Spring 1999

Dimensionality Reduction

G E INTERACTION USING JMP: AN OVERVIEW

Data Preprocessing Tasks

Covariance to PCA. CS 510 Lecture #8 February 17, 2014

Principal Components Analysis. Sargur Srihari University at Buffalo

Operators and the Formula Argument in lm

Factor Analysis. Statistical Background. Chapter. Herb Stenson and Leland Wilkinson

Ridge Regression. Chapter 335. Introduction. Multicollinearity. Effects of Multicollinearity. Sources of Multicollinearity

Principal Component Analysis (PCA) Our starting point consists of T observations from N variables, which will be arranged in an T N matrix R,

Ordination & PCA. Ordination. Ordination

Louis Roussos Sports Data

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Covariance to PCA. CS 510 Lecture #14 February 23, 2018

Principal Component Analysis

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

Repeated-Measures ANOVA in SPSS Correct data formatting for a repeated-measures ANOVA in SPSS involves having a single line of data for each

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Lecture 3: Inference in SLR

EDF 7405 Advanced Quantitative Methods in Educational Research. Data are available on IQ of the child and seven potential predictors.

Machine Learning 2nd Edition

Financial Econometrics Review Session Notes 3

Basics of Multivariate Modelling and Data Analysis

Retrieve and Open the Data

TAMS39 Lecture 10 Principal Component Analysis Factor Analysis

Statistical Analysis of Factors that Influence Voter Response Using Factor Analysis and Principal Component Analysis

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Eigenvalues, Eigenvectors, and an Intro to PCA

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Independent Samples ANOVA

Data reduction for multivariate analysis

Eigenvalues, Eigenvectors, and an Intro to PCA

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Transcription:

Principal Components Summary... 1 Statistical Model... 4 Analysis Summary... 5 Analysis Options... 7 Scree Plot... 8 Component Weights... 9 D and 3D Component Plots... 10 Data Table... 11 D and 3D Component Plots... 11 D and 3D Biplots... 1 Factorability Tests... 13 Save Results... 14 Summary The Principal Components procedure is designed to extract k principal components from a set of p quantitative variables X. The principal components are defined as the set of orthogonal linear combinations of X that have the greatest variance. Determining the principal components is often used to reduce the dimensionality of a set of predictor variables prior to their use in procedures such as multiple regression or cluster analysis. When the variables are highly correlated, the first few principal components may be sufficient to describe most of the variability present. Sample StatFolio: pca.sgp Sample Data The file 93cars.sgd contains information on 6 variables for n = 93 makes and models of automobiles, taken from Lock (1993). The table below shows a partial list of the data in that file: Make Model Engine Horsepower Fuel Passengers Length Size Tank Acura Integra 1.8 140 13. 5 177 Acura Legend 3. 00 18 5 195 Audi 90.8 17 16.9 5 180 Audi 100.8 17 1.1 6 193 BMW 535i 3.5 08 1.1 4 186 Buick Century. 110 16.4 6 189 Buick LeSabre 3.8 170 18 6 00 Buick Roadmaster 5.7 180 3 6 16 Buick Riviera 3.8 170 18.8 5 198 Cadillac DeVille 4.9 00 18 6 06 Cadillac Seville 4.6 95 0 5 04 Chevrolet Cavalier. 110 15. 5 18 013 by StatPoint Technologies, Inc. Principal Components - 1

It is desired to extract the principal components from the following variables: Engine Size Horsepower Fueltank Passengers Length Wheelbase Width U Turn Space Rear seat Luggage Weight A matrix plot of the data is shown below: Engine Size Horsepow er Fueltank Passengers Length Wheelbase Width U Turn Space Rear seat Luggage Weight As might be expected, the variables are highly correlated, since most are related to vehicle size. 013 by StatPoint Technologies, Inc. Principal Components -

Data Input The data input dialog box requests the names of the columns containing the data: Data: either the original observations or the sample covariance matrix ˆ. If entering the original observations, enter p numeric columns containing the n values for each column of X. If entering the sample covariance matrix, enter p numeric columns containing the p values for each column of ˆ. If the covariance matrix is entered, some of the tables and plots will not be available. Point Labels: optional labels for each observation. Select: subset selection. 013 by StatPoint Technologies, Inc. Principal Components - 3

Statistical Model The goal of a principal components analysis is to construct k linear combinations of the p variables X that contain the greatest variance. The linear combinations take the form Y1 a11x1 a1x... a1 p X p Y a1x1 ax... a p X p Yk ak1 X1 ak X... akpx p (1) The first principal component is that linear combination that has maximum variance, subject to the constraint that the coefficient vector has unit length, i.e., p i1 a 1 () ip If the covariance matrix of X equals, then the variance of Y 1 is Var( Y1) a 1a1 (3) The second principal combination is that linear combination that has the next greatest variance, such to the same constraint on unit length and also to the constraint that it be uncorrelated with the first principal component. Subsequent components explain as much of the remaining variance as possible, while being uncorrelated with all of the other components. Under this model, the coefficients a correspond to the eigenvectors of, while the variances of the Y s are equal to the eigenvalues: Var( ) (4) Y j j The population variance equals the sum of the eigenvalues Total population variance = 1 + + + p (5) One criterion for selecting the number of principal components to extract is to select all components for which the corresponding eigenvalue is at least 1, implying that the component represents at least a fraction 1/p of the total population variance. STATGRAPHICS gives the option of extracting principal components based on either the covariance matrix or the correlation matrix, depending on the Standardize setting on the Analysis Options dialog box. When the variables are in different units, it is usually best to base the analysis on the correlation matrix (which is the default). 013 by StatPoint Technologies, Inc. Principal Components - 4

Analysis Summary The Analysis Summary table is shown below: Principal Components Analysis Data variables: Engine Size Horsepower Fueltank Passengers Length Wheelbase Width U Turn Space Rear seat Luggage Weight Data input: observations Number of complete cases: 8 Missing value treatment: listwise Standardized: yes Number of components extracted: Principal Components Analysis Component Percent of Cumulative Number Eigenvalue Variance Percentage 1 7.9395 7.036 7.036 1.3354 1.03 84.068 3 0.47071 4.79 88.347 4 0.35348 3.11 91.559 5 0.69048.446 94.004 6 0.1904 1.79 95.734 7 0.1789 1.57 97.306 8 0.107148 0.974 98.80 9 0.084071 0.749 99.09 10 0.0694689 0.63 99.660 11 0.0373497 0.340 100.000 Displayed in the table are: Data variables: the names of the p input columns. Data input: either observations or matrix, depending upon whether the input columns contain the original observations or the sample covariance matrix. Number of complete cases: the number of cases n for which none of the observations were missing. Missing value treatment: how missing values were treated in estimating the covariance or correlation matrix. If listwise, the estimates were based on complete cases only. If pairwise, all pairs of non-missing data values were used to obtain the estimates. Standardized: yes if the analysis was based on the correlation matrix. No if it was based on the covariance matrix. 013 by StatPoint Technologies, Inc. Principal Components - 5

STATGRAPHICS Rev. 11/1/013 Number of components extracted: the number of components k extracted from the data. This number is based on the settings on the Analysis Options dialog box. A table is also displayed showing information for each of the p possible principal components: Component number: the component number j, from 1 to p. Eigenvalue: the eigenvalue of the estimated covariance or correlation matrix, ˆ j. Percentage of variance: the percent of the total estimated population variance represented by this component, equal to ˆ 100 j % ˆ ˆ... ˆ 1 k (6) Cumulative percentage: the cumulative percentage of the total estimated population variance accounted for by the first j components. In the example, the first k = components account for over 84% of the overall variance amongst the 11 variables. 013 by StatPoint Technologies, Inc. Principal Components - 6

Analysis Options Missing Values Treatment: method of handling missing values when estimating the sample covariances or correlations. Specify Listwise to use only cases with no missing values for any of the input variables. Specify Pairwise to use all pairs of observations in which neither value is missing. Standardize: check this box to base the analysis on the sample correlation matrix rather than the sample covariance matrix. This corresponds to standardizing each input variable before calculating the covariances, by subtracting its mean and dividing by its standard deviation. Extract By: the criterion used to determine the number of principal components to extract. Minimum Eigenvalue: if extracting by magnitude of the eigenvalues, the smallest eigenvalue for which a component will be extracted. Number of Components: if extracting by number of components, the number k. 013 by StatPoint Technologies, Inc. Principal Components - 7

Eigenvalue STATGRAPHICS Rev. 11/1/013 Scree Plot The Scree Plot can be very helpful in determining the number of principal components to extract. By default, it plots the size of the eigenvalues corresponding to each of the p possible components: 8 Scree Plot 6 4 0 0 4 6 8 10 1 Component A horizontal line is added at the minimum eigenvalue specified on the Analysis Options dialog box. In the plot above, note that only the first components have large eigenvalues. Pane Options Plot: value plotted on the vertical axis. 013 by StatPoint Technologies, Inc. Principal Components - 8

Component Weights The Component Weights pane shows the estimated value of the coefficients a for each component extracted: Table of Component Weights Component Component 1 Engine Size 0.3376-0.133891 Horsepower 0.6813-0.4485 Fueltank 0.31144-0.1014 Passengers 0.38683 0.53091 Length 0.335379 0.01 Wheelbase 0.335386 0.061033 Width 0.34896-0.13448 U Turn Space 0.9918-0.0830471 Rear seat 0.3156 0.53351 Luggage 0.76494 0.3776 Weight 0.337017-0.06599 The weights within each column often have interesting interpretations. In the example, note that the weights in the first column are all approximately the same. This implies that the first component is basically an average of all of the input variables. The second component is weighted most heavily in a positive direction on the number of Passengers, the Rear Seat room, and the amount of Luggage space, and in a negative direction on Horsepower. It likely differentiates amongst the different types of vehicles. 013 by StatPoint Technologies, Inc. Principal Components - 9

PCOMP_ STATGRAPHICS Rev. 11/1/013 D and 3D Component Plots These plots display the values of or 3 selected principal components for each of the n cases. It is useful to examine any points far away from the others, such as the highlighted Dodge Stealth, which has a very low value for the second component. An interesting variation of this plot is one in which the variables are coded according to another column, such as the type of vehicle: Plot of PCOMP_ vs PCOMP_1 3.3 1.3-0.7 Type Compact Large Midsize Small Sporty -.7-4.7-7 -4-1 5 8 PCOMP_1 To produce the above plot: 1. Press the Save Results button and save the Principal Components to new columns of the datasheet.. Select the X-Y Plot procedure from the top menu and input the new columns. 3. Select Pane Options and specify Type in the Point Codes field. It is now clear that the first component is related to the size of the vehicle, while the second component separates the sporty cars from the others. 013 by StatPoint Technologies, Inc. Principal Components - 10

Component Pane Options STATGRAPHICS Rev. 11/1/013 Specify the components to plot on each axis. Data Table The Data Table display the values of the principal components for each of the n cases. Table of Principal Components Component Component Row Label 1 1 Integra -1.4903 0.00673575 Legend.37408-0.4778 3 90 0.165636-0.61873 4 100.31 1.0154 5 535i 1.5815 -.15174 6 Century 0.737 1.39817 7 LeSabre 3.46805 0.778351 8 Roadmaster 6.6603 0.133406 9 Riviera.4466-1.07736 D and 3D Component Plots The Component Plots show the location of each variable in the space of or 3 selected components: Plot of Component Weights 0.55 0.35 Rear seat Passengers Luggage 0.15-0.05-0.5 Wheelbase Length U Turn Space Width Engine Size Weight Fueltank -0.45 Horsepower 0 0.1 0. 0.3 0.4 Component 1 Variables furthest from the reference lines at 0 make the largest contribution to the components. 013 by StatPoint Technologies, Inc. Principal Components - 11

Component STATGRAPHICS Rev. 11/1/013 D and 3D Biplots The Biplots display both the observations and the variables on a single plot. Biplot 5.3 3.3 1.3-0.7 -.7-4.7 Rear seat Passengers Luggage Wheelbase Length U Turn Space Engine Width Size Weight Fueltank Horsepower -7-4 -1 5 8 Component 1 The point symbols correspond to the observations. The ends of the solid lines correspond to the variables. 013 by StatPoint Technologies, Inc. Principal Components - 1

Factorability Tests Two measures of factorability are provided to help determine whether or not it is likely to be worthwhile to perform a principal components analysis on a set of variables. These measures are: 1. The Kaiser-Meyer-Olsen Measure of Sampling Adequacy.. Bartlett s Test of Sphericity. The output is shown below: Factorability Tests Kaiser-Meyer-Olkin Measure of Sampling Adequacy KMO = 0.9019 Bartlett's Test of Sphericity Chi-Square = 199.83 D.F. = 55 P-Value = 0.0 The Kaiser-Meyer-Olsen Measure of Sampling Adequacy constructs a statistic that measures how efficiently a PCA analysis can extract factors from a set of variables. It compares the magnitudes of the correlation amongst the variables to the magnitude of the partial correlations, determining how much common variance exists amongst the variables. The KMO statistic is calculated from ji ji KMO r a i i ij r ij i ji ij (7) where r ij = correlation between variables i and j and a ij = partial correlation between variables i and j accounting for all of the other variables. For a PCA to be useful, a general rule of thumb is that KMO should be greater than or equal to 0.6. In the example above, the value of 0.9 suggests that a PCA should be able to efficiently extract common factors. Bartlett s Test of Sphericity tests the null hypothesis that the correlation matrix amongst the data variables is an identity matrix, i.e., that there are no correlations amongst any of the variables. In such a case, an attempt to extract common factors would be meaningless. The test statistic is calculated from the determinant of the correlation matrix R according to p 5 n 1 ln R (8) 6 and compared to a chi-square distribution with p(p-1)/ degrees of freedom. A small P-Value, as in the example above, leads to a rejection of the null hypothesis and the conclusion that it makes sense to perform a PCA. Because of the extreme sensitivity of Bartlett s test, however, it is usually ignored unless the number of observations divided by the number of variables is no greater than 5. 013 by StatPoint Technologies, Inc. Principal Components - 13

Pane Options STATGRAPHICS Rev. 11/1/013 The Pane Options dialog box specifies whether individual KMO values should be computed for each variable: If checked, a KMO statistic will be calculated for each variable using i j i j KMO j r a ij r ij i j ij (9) The sample output is shown below: Variable KMO Engine Size 0.9585 Horsepower 0.869934 Fueltank 0.95566 Passengers 0.90833 Length 0.936307 Wheelbase 0.938731 Width 0.91695 U Turn Space 0.948369 Rear seat 0.84573 Luggage 0.93615 Weight 0.89631 Save Results The following results may be saved to the datasheet: 1. Eigenvalues the k eigenvalues.. Component weights k columns, each containing p estimates of the coefficients a. 3. Principal Components k columns, each containing n values corresponding to the extracted principal components. 013 by StatPoint Technologies, Inc. Principal Components - 14