Factor Analysis. Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

Similar documents
Chapter 4: Factor Analysis

STAT 730 Chapter 9: Factor analysis

TAMS39 Lecture 10 Principal Component Analysis Factor Analysis

Factor Analysis Edpsy/Soc 584 & Psych 594

Introduction to Factor Analysis

9.1 Orthogonal factor model.

CS281 Section 4: Factor Analysis and PCA

Multivariate Statistical Analysis

1 Data Arrays and Decompositions

Introduction to Factor Analysis

Hypothesis Testing. Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

Canonical Correlation Analysis of Longitudinal Data

Factor Analysis (10/2/13)

The 3 Indeterminacies of Common Factor Analysis

Factor Analysis and Kalman Filtering (11/2/04)

Multivariate Statistics

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

[y i α βx i ] 2 (2) Q = i=1

STA 294: Stochastic Processes & Bayesian Nonparametrics

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

More PCA; and, Factor Analysis

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Capítulo 12 FACTOR ANALYSIS 12.1 INTRODUCTION

MATH 829: Introduction to Data Mining and Analysis Graphical Models I

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 6: Bivariate Correspondence Analysis - part II

Applied Multivariate Analysis

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

ECE 598: Representation Learning: Algorithms and Models Fall 2017

A Peak to the World of Multivariate Statistical Analysis

Multiple Linear Regression

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models

Exploratory Factor Analysis and Principal Component Analysis

3.1. The probabilistic view of the principal component analysis.

Factor Analysis. -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD

Intermediate Social Statistics

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

FE670 Algorithmic Trading Strategies. Stevens Institute of Technology

Eigenvalues, Eigenvectors, and an Intro to PCA

Principal Components Theory Notes

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

CS281A/Stat241A Lecture 17

Eigenvalues, Eigenvectors, and an Intro to PCA

Exploratory Factor Analysis and Principal Component Analysis

Assessing the dependence of high-dimensional time series via sample autocovariances and correlations

Dimension Reduction. David M. Blei. April 23, 2012

Regression: Lecture 2

A survey of dimension reduction techniques

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Stat 206, Week 6: Factor Analysis

A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE

GAUSSIAN PROCESS REGRESSION

Linear Methods for Prediction

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Factor Analysis. Qian-Li Xue

STA414/2104 Statistical Methods for Machine Learning II

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery

STA 4273H: Statistical Machine Learning

STATISTICAL LEARNING SYSTEMS

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Advising on Research Methods: A consultant's companion. Herman J. Ader Gideon J. Mellenbergh with contributions by David J. Hand

Measurement Theory. Reliability. Error Sources. = XY r XX. r XY. r YY

Computational functional genomics

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Inferences about a Mean Vector

B. Weaver (18-Oct-2001) Factor analysis Chapter 7: Factor Analysis

Linear Dimensionality Reduction

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Principal Component Analysis

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

CS229 Lecture notes. Andrew Ng

Lecture 8: Classification

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

FINM 331: MULTIVARIATE DATA ANALYSIS FALL 2017 PROBLEM SET 3

Whitening and Coloring Transformations for Multivariate Gaussian Data. A Slecture for ECE 662 by Maliha Hossain

Basic Concepts in Matrix Algebra

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

CMSC858P Supervised Learning Methods

Title. Description. var intro Introduction to vector autoregressive models

UCLA STAT 233 Statistical Methods in Biomedical Imaging

Notes on Random Vectors and Multivariate Normal

A Primer on Statistical Inference using Maximum Likelihood

Forecasting 1 to h steps ahead using partial least squares

Lecture 4: Principal Component Analysis and Linear Dimension Reduction

Bayesian Regression Linear and Logistic Regression

Default Priors and Effcient Posterior Computation in Bayesian

Random Vectors, Random Matrices, and Matrix Expected Value

Key Algebraic Results in Linear Regression

Bayesian inference for factor scores

Part 6: Multivariate Normal and Linear Models

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

STA 414/2104: Machine Learning

Transcription:

Factor Analysis Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA 1 Factor Models The multivariate regression model Y = XB +U expresses each row Y i R p as a linear combination X i B of the q columns of X, plus some mean-zero random error U i. Perhaps fewer than q terms would suffice to predict Y adequately or, more realistically, perhaps some number m < q of linear combinations of the columns of X would suffice. That is the essence of Factor Analysis, a method proposed in 1904 by psychologist Charles Spearman in his effort to show that all measures of mental ability (mathematical skill, vocabulary, other verbal skills, artistic skills, logical reasoning ability, etc.) could be explained by a single quantity he called g for general intelligence (also the idea behind IQ). The same goal comes up in many application areas to explain some rather large number q of measurements in terms of a much smaller number m. An extreme example of recent importance is the search for genes within DNA sequence data where q might be many thousands, much larger than the number n of subjects a challenge for the traditional multivariate tools, leading to the development of data mining methodology. The simplest possibility would be for most of the variability of X i to be explained by a linear relation to a small number m of factors, an idea we now pursue. For a nice introduction from a social scientist s perspective, see Richard Darlington s Cornell notes at http://www.psych.cornell.edu/ Darlington/factor.htm. An interesting misuse of FA was used to design Duke s abysmal Course Evaluation Forms. 1

1.1 A Single Vector Let X R p be a random vector with mean µ R p and covariance Σ S p + and let m N. X is said to follow a m-factor model if we can write X = Λf + µ + u (1) for some constant (p m) matrix Λ (the loading matrix ) and random vectors f R m (the factors ) and u R p. The elements of f are called common factors, while those of u are (specific or) unique factors or errors. Of course this will get more interesting below when we have more than one vector X. Without any loss of generality we may insist that: E[f] = 0 Cov[f] = Ef f = I E[u] = 0 Cov[u] = Euu = Ψ = diag(ψ 11,...,ψ pp ) Cov[f,u] = Ef u = 0 It follows that the covariance Σ = E(X µ)(x µ) may be expressed Σ = ΛΛ + Ψ () as the sum of a common component and a unique component. Even if a m-factor model holds for X, it is not unique for any (m m) orthogonal matrix G, we can re-write Equation (1) as X = ΛG Gf + u + µ with common factor Gf (also with mean E[Gf] = 0 and covariance E[Gff G ] = I). For fixed Ψ, this is the only indeterminacy i.e., once Ψ is fixed then Λ is determined up to an orthogonal rotation. The indeterminacy can be resolved (if desired) by imposing some arbitrary 1 m (m 1)-dimensional linear constraint, like insisting that Λ Ψ 1 Λ be diagonal or filling Λ with as many zeros as possible. Does every covariance satisfy Equation (), or is there anything special about factor models? In general, the dimension of the set of possible covariance matrices Σ S p + is p(p + 1)/ (for example, that s the number of non-zero entries in the Cholesky decomposition of a positive definite matrix), while Σ satisfying Equation () is determined by the pm elements of Λ and the p diagonal elements of Ψ. Insisting that the (m m) symmetric matrix

Λ Ψ 1 Λ be diagonal introduces an additional m(m 1)/ constraints, so Σ lies in a set whose dimension is { } { } p(p + 1) m(m 1) s = pm + p = (p m) (p + m) (3) smaller than that of S p + (if, as usual, s > 0), making m-factor Model a genuine distinction. Note s > whenever m < p + 1 p + 1/4. 1. Fitting the Model If s < 0 in Equation (3) above then there are infinitely-many solutions Λ,Ψ to Equation (), and the factor model isn t well-defined. For s = 0 there s (typically) a unique solution; for s > 0 there will typically be no exact solutions, but we can try for an approximate fit. Let s now suppose we have n iid random vectors X i each satisfying Equation (1), assembled as the rows of an (n p) matrix X. Since µ is of little interest here we estimate it by ˆµ = x = 1 n X 1, and as usual estimate Σ by its MLE (for a normal model) ˆΣ = 1 n (X 1 x ) (X 1 x ); (4) now the goal is to find m, a (p m) matrix ˆΛ, and a diagonal matrix ˆΨ S m + that satisfy ˆΣ ˆΛˆΛ + ˆΨ. (5) For any proposed ˆΛ we can satisfy Equation (5) on the diagonals exactly by setting each ˆψ jj := ˆσ jj (ˆΛˆΛ ) jj = ˆσ jj ˆλ jl (6) l (assuming that s non-negative), so we now turn to finding m and ˆΛ R p m. The usual practice is to find the smallest value of m that seems tenable for a given data set, and then to estimate ˆΛ (and hence ˆΨ) for that m. Since the Factor model of Equation (1) is invariant under changing the location or scale of X i, we now simplify the presentation by assuming that x = 0 and ˆσ jj = 1 n i x ij = 1 (just replace X with ( I 1 n 11 )X ( diag ˆΣ ) 1/ ) so now R = ˆΣ is the sample correlation matrix with diagonal entries of r jj = 1 and Equation (6) becomes ˆψ jj = 1 ˆλ l jl. 3

Here s one way to estimate Ψ and Λ, called Principal Factor Analysis. Standardize as above so the rows of X have sample mean zero and sample variance one, and let R be the sample correlation matrix. Pick any { ψ 0 jj } point R m + as an initial estimate (a common choice is ψ0 jj { } = 1 max k j r jk ) and construct Ψ 0 := diag( ψjj 0 ). The matrix R Ψ 0 may not be positive-definite but it s at least symmetric, so{ it has } real eigenvalues λ 1 λ λ p, some of which will be positive if ψjj 0 R p + aren t too big. Let m be the number of positive eigenvalues and let ˆΛ be the (p m) matrix whose m columns are the associated eigenvectors, each scaled by the square root of its eigenvalue. Then ˆΣ = R will satisfy Equation (5). Now set ˆψ jj = [R ˆΛˆΛ ] jj to get the diagonal elements exactly. Very efficient algorithms implementing this approach are built into standard statistical software packages for example, factanal() in R. As an alternative approach to estimating ˆΛ and ˆΨ, we may assume a multivariate Gaussian model and maximize the log likelihood over all choices of Λ, Ψ, or can construct joint prior distribution on them and evaluate the posterior mean using MCMC. Example Psychologists often use Factor Analysis in an effort to explain the variation in a relatively high-dimensional (q) data-set by a latent relatively low-dimensional (m) trait, like intelligence or verbal and quantitative abilities. Here (taken from Mardia et al. (1979, Ch. 9), who excerpted it from Spearman (1904, p. 75), the paper in which Factor Analysis was introduced) is the sample correlation matrix for test scores of children on p = 3 verbal topics (Classics, English, and French): 1 0.83 0.78 R = 0.83 1 0.67 0.78 0.67 1 Note m = p + 1 p + 1/4 = 1 factor will lead to s = 0 for p = 3 and hence a unique FA model; upon solving for Λ and Ψ we find 0.983 R = 0.844 [ 0.983 0.844 0.794 ] 0.034 0 0 + 0 0.87 0 0.794 0 0 0.370 4

exactly. One (perhaps Spearman) might then argue that abilities X = [X 1,X,X 3 ] in Classics, English, and French all reflect a single verbal skill factor f No(0, 1) through the relation 0.983 U 1 No(0,0.034) X = 0.844 f + U No(0,0.87) 0.794 U 3 No(0,0.370) 3 Interpretation and Dangers The underlying idea of FA is that the information in a data set of high dimension p may be summarized adequately by a much smaller number m of quantities linearly-related to the original data. For example, subjects success in answering a large number p of questions might reflect their differences in a quite small number of variables. Spearman s original hypothesis was that human mental acuity could be explained by a single variable, general intelligence g; current SAT tests report a three-dimensional score (it used to be just two-dimensional). In a comic misuse of FA the questions on early versions of Duke s Course Evaluation forms were designed by eliminating questions with low loading on the first factor, on the assumption that course quality was a one-dimensional quantity (with more reflection one might expect some students to prefer more quantitative courses, others more verbal ones; some might prefer a more intense experience, others a broader and lighter one; some might prefer an entertaining experience while others might prefer a penetrating one). Instead of seeking questions that would help describe these various features of classes to help students make informed choices, all questions were removed from draft versions of the Course Evaluation Forms except those with a high load on the first factor, which the administrator managing the process assumed would be overall quality. Sigh. Like Principal Components, Factor Analysis is often applied as a preparatory step to reduce the dimension of a problem before some other analysis (usually multivariate regression) is applied. If (as usual) the same data set is used to determine the factors and for the subsequent regression, then this approach introduces a bias that will lead to confidence intervals that are a bit too short, and that will cover true parameter values with less than the nominal probability (say, 95%). Probably the best alternative is Bayesian Model Averaging (BMA), which has its own advantages and disadvantages. 5

References Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979), Multivariate Analysis, New York, NY: Academic Press. Spearman, C. (1904), General Intelligence, objectively determined and measured, American Journal of Psychology, 15, 01 9. 6