Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Similar documents
Dimensionality Reduction vs. Clustering

Bayes nets with tabular CPDs We have mostly focused on graphs where all latent nodes are discrete, and all CPDs/potentials are full tables.

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Unsupervised Learning 2001

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Machine Learning for Data Science (CS 4786)

Bayesian Methods: Introduction to Multi-parameter Models

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Machine Learning for Data Science (CS4786) Lecture 4

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Last time: Moments of the Poisson distribution from its generating function. Example: Using telescope to measure intensity of an object

5.1 Review of Singular Value Decomposition (SVD)

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Regression and generalization

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Machine Learning for Data Science (CS 4786)

Quick Review of Probability

Lecture 11 and 12: Basic estimation theory

Axis Aligned Ellipsoid

Notes 27 : Brownian motion: path properties

Mixtures of Gaussians and the EM Algorithm

BHW #13 1/ Cooper. ENGR 323 Probabilistic Analysis Beautiful Homework # 13

State Space Representation

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Exponential Families and Bayesian Inference

Section 14. Simple linear regression.

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Probability and Statistics

18.S096: Homework Problem Set 1 (revised)

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Notes on iteration and Newton s method. Iteration

Classification with linear models

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

Chapter 6 Sampling Distributions

Outline. L7: Probability Basics. Probability. Probability Theory. Bayes Law for Diagnosis. Which Hypothesis To Prefer? p(a,b) = p(b A) " p(a)

Distributional Similarity Models (cont.)

CS284A: Representations and Algorithms in Molecular Biology

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

Quick Review of Probability

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Linear Regression Demystified

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Maximum Likelihood Estimation

of the matrix is =-85, so it is not positive definite. Thus, the first

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Distributional Similarity Models (cont.)

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

EE 6885 Statistical Pattern Recognition

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Random Variables, Sampling and Estimation

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

Probabilistic Unsupervised Learning

Physics 324, Fall Dirac Notation. These notes were produced by David Kaplan for Phys. 324 in Autumn 2001.

Stat 421-SP2012 Interval Estimation Section

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

15-780: Graduate Artificial Intelligence. Density estimation

Lecture 9: September 19

Eigenvalues and Eigenvectors

A PROBABILITY PRIMER

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

Homework 3 Solutions

CMSE 820: Math. Foundations of Data Sci.

RADICAL EXPRESSION. If a and x are real numbers and n is a positive integer, then x is an. n th root theorems: Example 1 Simplify

Chapter 11 Output Analysis for a Single Model. Banks, Carson, Nelson & Nicol Discrete-Event System Simulation

Elementary manipulations of probabilities

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Chapter 6 Principles of Data Reduction

Bivariate Sample Statistics Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 7

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

5.1. The Rayleigh s quotient. Definition 49. Let A = A be a self-adjoint matrix. quotient is the function. R(x) = x,ax, for x = 0.

Joint Probability Distributions and Random Samples. Jointly Distributed Random Variables. Chapter { }

Clustering: Mixture Models

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Chimica Inorganica 3

Relations Among Algebras

Lecture 8: October 20, Applications of SVD: least squares approximation

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

September 2012 C1 Note. C1 Notes (Edexcel) Copyright - For AS, A2 notes and IGCSE / GCSE worksheets 1

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Probabilistic Unsupervised Learning

Lecture 19: Convergence

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Probability and statistics: basic terms

Lecture 2: Monte Carlo Simulation

Math 113 Exam 3 Practice

Transcription:

Lecture 10: Factor Aalysis ad Pricipal Compoet Aalysis Sam Roweis February 9, 2004 Whe we assume that the subspace is liear ad that the uderlyig latet variable has a Gaussia distributio we get a model kow as factor aalysis: data y (p-dim); latet variable (k-dim) Factor Aalysis p() = N ( 0, I) p(y, θ) = N (y + Λ, Ψ) where is the mea vector, Λ is the p by k factor loadig matri, ad Ψ is the sesor oise covariace (ususally diagoal). Importat: sice the product of Gaussias is still Gaussia, the joit distributio p(, y), the other margial p(y) ad the coditioal p( y) are also Gaussia. λ 1 λ2 Cotiuous Latet Variables I may models there are some uderlyig causes of the data. Miture models use a discrete class variable: clusterig. Sometimes, it is more appropriate to thik i terms of cotiuous factors which cotrol the data we observe. Geometrically, this is equivalet to thikig of a data maifold or subspace. To geerate data, first geerate a poit withi the maifold the add oise. Coordiates of poit are compoets of latet variable. λ 1 λ2 Margial Data Distributio Just as with discrete latet variables, we ca compute the margial desity p(y θ) by summig out. But ow the sum is a itegral: p(y θ) = p()p(y, θ)d = N (y, ΛΛ +Ψ) which ca be doe by completig the square i the epoet. However, sice the margial is Gaussia, we ca also just compute its mea ad covariace. (Assume oise ucorrelated with data.) E[y] = E[ + Λ + oise] = + ΛE[] + E[oise] = + Λ 0 + 0 = Cov[y] = E[(y )(y ) ] = E[( + Λ + oise )( + Λ + oise ) ] = E[(Λ + )(Λ + ) ] = ΛE( )Λ + E( ) = ΛΛ + Ψ

FA = Costraied Covariace Gaussia Margial desity for factor aalysis (y is p-dim, is k-dim): p(y θ) = N (y, ΛΛ +Ψ) So the effective covariace is the low-rak outer product of two log skiy matrices plus a diagoal matri: Cov[y] Λ I other words, factor aalysis is just a costraied Gaussia model. (If Ψ were ot diagoal the we could model ay Gaussia ad it would be poitless.) Learig: how should we fit the ML parameters? It is easy to fid : just take the mea of the data. From ow o assume we have doe this ad re-cetred y. What about the other parameters? Λ T Ψ EM for Factor Aalysis We will do maimum likelihood learig usig (surprise, surprise) the EM algorithm. E-step: q t+1 = p( y, θ t ) M-step: θ t+1 = argma θ qt+1 ( y ) log p(y, θ)d For E-step we eed the coditioal distributio (iferece) For M-step we eed the epected log of the complete data. E step : q t+1 = p( y, θ t ) = N ( m, V ) M step : Λ t+1 = argma Λ Ψ t+1 = argma Ψ l c (, y ) q t+1 l c (, y ) q t+1 Likelihood Fuctio Sice the FA data model is Gaussia, likelihood fuctio is simple: l(θ; D) = N 2 log ΛΛ + Ψ 1 (y ) (ΛΛ + Ψ) 1 (y ) 2 [ = N 2 log V 1 2 trace V 1 ] (y )(y ) = N 2 log V 1 ] [V 2 trace 1 S V is model covariace; S is sample data covariace. I other words, we are tryig to make the costraied model covariace as close as possible to the observed covariace, where close meas the trace of the ratio. Thus, the sufficiet statistics are the same as for the Gaussia: mea y ad covariace (y )(y ). From Joit Distributio to Coditioal To get the coditioal p( y) we will start with the joit p(, y) ad apply Bayes rule for Gaussia coditioals. Write dow the joit distribtio of ad y: [ ] [ ] [ ] [ ] 0 I Λ p( ) = N (, y y Λ ΛΛ ) + Ψ where the corer elemets Λ, Λ come from Cov[, y]: Cov[, y] = E[( 0)(y ) ] = E[( + Λ + oise ) ] = E[(Λ + oise) ] = Λ Assume oise is ucorrelated with data or latet variables.

E-step: Iferece i Factor Aalysis Apply the Gaussia coditioig formulas to the joit distributio we derived above. This gives: p( y) = N ( m, V) V = I Λ (ΛΛ + Ψ) 1 Λ m = Λ (ΛΛ + Ψ) 1 (y ) Now apply the matri iversio lemma to get: p( y) = N ( m, V) V = (I + Λ Ψ 1 Λ) 1 m = VΛ Ψ 1 (y ) y Complete Data Likelihood We kow the optimal is the data mea. Assume the mea has bee subtracted off y from ow o. The complete likelihood (igorig mea): l c (Λ, Ψ) = log p(, y ) = log p( ) + log p(y ) = N 2 log Ψ 1 1 (y Λ ) Ψ 1 (y Λ ) 2 2 l c (Λ, Ψ) = N 2 log Ψ N 2 trace[sψ 1 ] S = 1 (y Λ )(y Λ ) N Iferece is Liear Note: iferece just multiplies y by a matri: p( y) = N ( m, V) V = I Λ (ΛΛ + Ψ) 1 Λ = (I + Λ Ψ 1 Λ) 1 m = Λ (ΛΛ + Ψ) 1 (y ) = VΛ Ψ 1 (y ) Note: iferece of the posterior mea is just a liear operatio! m = β(y ) where β ca be computed beforehad give the model parameters. Also: posterior covariace does ot deped o observed data! cov[ y] = V = (I + Λ Ψ 1 Λ) 1 M-step: Optimize Parameters Take the derivates of the complete log likelihood wrt. parameters: l c (Λ, Ψ)/ Λ = Ψ 1 y + Ψ 1 Λ l c (Λ, Ψ)/ Ψ 1 = +(N/2)Ψ (N/2)S Take the epectatio with respect to q t from E-step: < l Λ > = Ψ 1 y m + Ψ 1 Λ V < l Ψ 1 > = +(N/2)Ψ (N/2) < S > Fially, set the derivatives to zero to solve for optimal parameters: ( ) ( ) 1 Λ t+1 = y m V [ Ψ t+1 = 1 N diag y y + Λ t+1 ] m y

Fial Algorithm: EM for Factor Aalysis First, set equal to the sample mea (1/N) y, ad subtract this mea from all the data. Now ru the followig iteratios: E step : q t+1 = p( y, θ t ) = N ( m, V ) V = (I + Λ Ψ 1 Λ) 1 m = V Λ Ψ 1 (y ) ( ) ( M step : Λ t+1 = y m Ψ t+1 = 1 N diag [ V ) 1 y y + Λ t+1 m y ] Likelihood Fuctio As with FA, the PPCA data model is Gaussia. Thus, the likelihood fuctio is simple: l(θ; D) = N 2 log ΛΛ + Ψ 1 (y ) (ΛΛ + Ψ) 1 (y ) 2 [ = N 2 log V 1 2 trace V 1 ] (y )(y ) = N 2 log V 1 ] [V 2 trace 1 S V is model covariace; S is sample data covariace. I other words, we are tryig to make the costraied model covariace as close as possible to the observed covariace, where close meas the trace of the ratio. Thus, the sufficiet statistics are the same as for the Gaussia: mea y ad covariace (y )(y ). Pricipal Compoet Aalysis I Factor Aalysis, we ca write the margial desity eplicitly: p(y θ) = p()p(y, θ)d = N (y, ΛΛ +Ψ) Noise Ψ mut be restricted for model to be iterestig. (Why?) I Factor Aalysis the restrictio is that Ψ is diagoal (ais-aliged). What if we further restrict Ψ = σ 2 I (ie spherical)? We get the Probabilistic Pricipal Compoet Aalysis (PPCA) model: p() = N ( 0, I) p(y, θ) = N (y + Λ, σ 2 I) where is the mea vector, colums of Λ are the pricipal compoets (usually orthogoal), ad σ 2 is the global sesor oise. Fittig the PPCA model The stadard EM algorithm applies to PPCA also: E-step: q t+1 = p( y, θ t ) M-step: θ t+1 = argma θ qt+1 ( y ) log p(y, θ)d For this we eed the coditioal distributio (iferece) ad the epected log of the complete data. Results: E step : q t+1 = p( y, θ t ) = N ( m, V ) V = (I + σ 2 Λ Λ) 1 m = σ 2 V Λ (y ) ( ) ( ) 1 M step : Λ t+1 = y m V [ σ 2t+1 = 1 y y + Λ t+1 DN i m y ] ii

PCA: Zero Noise Limit The traditioal PCA model is actually a limit as σ 2 0. The model we saw is actually called probabilistic PCA. However, the ML parameters Λ are the same. The oly differece is the global sesor oise σ 2. I the zero oise limit iferece is easier: orthogoal projectio. lim Λ (ΛΛ + σ 2 I) 1 = (Λ Λ) 1 Λ σ 2 0 y Scale Ivariace i Factor Aalysis I FA the scale of the data is uimportat: we ca multiply y i by α i without chagig aythig: i α i i Λ ij α i Λ ij Ψ i α 2 i Ψ i j However, the rotatio of the data is importat. FA looks for directios of large correlatio i the data, so it is ot fooled by large variace oise. FA PCA Direct Fittig For FA the parameters are coupled i a way that makes it impossible to solve for the ML params directly. We must use EM or other oliear optimizatio techiques. But for (P)PCA, the ML params ca be solved for directly: The k th colum of Λ is the k th largest eigevalue of the sample covariace S times the associated eigevector. The global sesor oise σ 2 is the sum of all the eigevalues smaller tha the k th oe. This techique is good for iitializig FA also. Actually PCA is the limit as the ratio of the oise variace o the output to the prior variace o the latet variables goes to zero. We ca either achieve this with zero oise or with ifiite variace priors. Rotatioal Ivariace i PCA I PCA the rotatio of the data is uimportat: we ca multiply the data y by ad rotatio Q without chagig aythig: Q Λ QΛ Ψ uchaged However, the scale of the data is importat. PCA looks for directios of large variace, so it will chase big oise directios. FA PCA

Gaussias are Footballs i High-D Recall the ituitio that Gaussias are hyperellipsoids. Mea == cetre of football Eigevectors of covariace matri == aes of football Eigevalues == legths of aes I FA our football is a ais aliged cigar. I PPCA our football is a sphere of radius σ 2. Review: Matri Iversio Lemma There is a good trick for ivertig matrices whe they ca be decomposed ito the sum of a easily iverted matri (D) ad a low rak outer product. It is called the matri iversio lemma. (D AB 1 A ) 1 = D 1 + D 1 A(B A D 1 A) 1 A D 1 FA Ψ PCA ει Review: Gaussia Coditioig Remember the formulas [ ] for codtioal [ ] [ Gaussia ] [ distributios: ] 1 1 1 Σ11 Σ p( ) = N (, 12 ) 2 2 2 Σ 21 Σ 22 p( 1 2 ) = N ( 1 m 1 2, V 1 2 ) m 1 2 = 1 + Σ 12 Σ 1 22 ( 2 2 ) V 1 2 = Σ 11 Σ 12 Σ 1 22 Σ 21 Review: Matri Derivatives You ofte eed these tricks to compute the M-step: A log A = (A 1 ) A trace[b A] = B A trace[ba CA] = 2CAB

Review: Meas, Variaces ad Covariaces Remember the defiitio of the mea ad covariace of a vector radom variable: E[] = p()d = m Cov[] = E[( m)( m) ] = ( m)( m) p()d = V which is the epected value of the outer product of the variable with itself, after subtractig the mea. Also, the covariace betwee two variables: Cov[, y] = E[( m )(y m y ) ] = C = ( m )(y m y ) p(, y)ddy = C y which is the epected value of the outer product of oe variable with aother, after subtractig their meas. Note: C is ot symmetric.