Dimensionality Reduction vs. Clustering

Similar documents
Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Bayes nets with tabular CPDs We have mostly focused on graphs where all latent nodes are discrete, and all CPDs/potentials are full tables.

Unsupervised Learning 2001

Machine Learning for Data Science (CS 4786)

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

Mixtures of Gaussians and the EM Algorithm

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Machine Learning for Data Science (CS 4786)

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

5.1 Review of Singular Value Decomposition (SVD)

Distributional Similarity Models (cont.)

Distributional Similarity Models (cont.)

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Bayesian Methods: Introduction to Multi-parameter Models

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

18.S096: Homework Problem Set 1 (revised)

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

CS284A: Representations and Algorithms in Molecular Biology

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Algorithms for Clustering

Lecture 19: Convergence

Regression and generalization

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Last time: Moments of the Poisson distribution from its generating function. Example: Using telescope to measure intensity of an object

Axis Aligned Ellipsoid

Chapter 6 Principles of Data Reduction

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

15-780: Graduate Artificial Intelligence. Density estimation

Chapter 6 Sampling Distributions

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

BHW #13 1/ Cooper. ENGR 323 Probabilistic Analysis Beautiful Homework # 13

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

Maximum Likelihood Estimation

Notes 27 : Brownian motion: path properties

Machine Learning for Data Science (CS4786) Lecture 4

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Linear Regression Demystified

Classification with linear models

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

State Space Representation

TAMS24: Notations and Formulas

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

5.1. The Rayleigh s quotient. Definition 49. Let A = A be a self-adjoint matrix. quotient is the function. R(x) = x,ax, for x = 0.

Machine Learning Theory (CS 6783)

5 : Exponential Family and Generalized Linear Models

Clustering: Mixture Models

Lecture 2: Monte Carlo Simulation

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Signals & Systems Chapter3

Singular value decomposition. Mathématiques appliquées (MATH0504-1) B. Dewals, Ch. Geuzaine

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

Stat 421-SP2012 Interval Estimation Section

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

Lecture 12: November 13, 2018

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

REVISION SHEET FP1 (MEI) ALGEBRA. Identities In mathematics, an identity is a statement which is true for all values of the variables it contains.

Probabilistic Unsupervised Learning

Quick Review of Probability

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

An Introduction to Asymptotic Theory

Lecture 20: Multivariate convergence and the Central Limit Theorem

Outline. L7: Probability Basics. Probability. Probability Theory. Bayes Law for Diagnosis. Which Hypothesis To Prefer? p(a,b) = p(b A) " p(a)

Random Variables, Sampling and Estimation

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

RADICAL EXPRESSION. If a and x are real numbers and n is a positive integer, then x is an. n th root theorems: Example 1 Simplify

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Unbiased Estimation. February 7-12, 2008

Probabilistic Unsupervised Learning

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Introduction to Machine Learning DIS10

Topics in Eigen-analysis

ALGEBRAIC GEOMETRY COURSE NOTES, LECTURE 5: SINGULARITIES.

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

11 Correlation and Regression

Efficient GMM LECTURE 12 GMM II

Mathematical Statistics - MS

For a 3 3 diagonal matrix we find. Thus e 1 is a eigenvector corresponding to eigenvalue λ = a 11. Thus matrix A has eigenvalues 2 and 3.

Lecture 11 and 12: Basic estimation theory

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

CSE 527, Additional notes on MLE & EM

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Understanding Samples

Lecture 8: October 20, Applications of SVD: least squares approximation

Quick Review of Probability

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Lecture 23: Minimal sufficiency

We are mainly going to be concerned with power series in x, such as. (x)} converges - that is, lims N n

Introduction to Extreme Value Theory Laurens de Haan, ISM Japan, Erasmus University Rotterdam, NL University of Lisbon, PT

Transcription:

Dimesioality Reductio vs. Clusterig Lecture 9: Cotiuous Latet Variable Models Sam Roweis Traiig such factor models (e.g. FA, PCA, ICA) is called dimesioality reductio. You ca thik of this as (o)liear regressio with missig iputs. Cotiuous causes ca sometimes be much more efficiet at represetig iformatio tha discrete causes. For example, if there are two factors, with about 256 settigs each we ca describe the latet causes with two 8-bit umbers. If we tried to cluster we would eed 2 16 10 5 clusters. November 4, 2003 Cotiuous Latet Variable Models Ofte there are some ukow uderlyig causes of the data. Mixture models use a discrete class variable: clusterig. Sometimes, it is more appropriate to thik i terms of cotiuous factors which cotrol the data we observe. Geometrically, this is equivalet to thikig of a data maifold or subspace. y 3 µ y 1 y 2 To geerate data, first geerate a poit withi the maifold the add oise. Coordiates of poit are compoets of latet variable. λ 1 λ2 Whe we assume that the subspace is liear ad that the uderlyig latet variable has a Gaussia distributio we get a model kow as factor aalysis: data y (p-dim); latet variable z (k-dim) Factor Aalysis µ y 1 y 2 p(z) = N (z 0, I) p(y z, θ) = N (y µ + Λz, Ψ) where µ is the mea vector, Λ is the p by k factor loadig matrix, ad Ψ is the sesor oise covariace (usually diagoal). Importat: sice the product of Gaussias is still Gaussia, the joit distributio p(z, y), the other margial p(y) ad the coditioal p(z y) are also Gaussia. y 3 λ 1 λ2

Margial Data Distributio Just as with discrete latet variables, we ca compute the margial desity p(y θ) by summig out z. But ow the sum is a itegral: p(y θ) = p(z)p(y z, θ)dz = N (y µ, ΛΛ +Ψ) z which ca be doe by completig the square i the expoet. However, sice the margial is Gaussia, we ca also just compute its mea ad covariace. (Assume oise ucorrelated with data.) E[y] = E[µ + Λz + oise] = µ + ΛE[z] + E[oise] = µ + Λ 0 + 0 = µ Cov[y] = E[(y µ)(y µ) ] = E[(µ + Λz + oise µ)(µ + Λz + oise µ) ] = E[(Λz + )(Λz + ) ] = ΛE(zz )Λ + E( ) = ΛΛ + Ψ Remider: Gaussia Coditioig Remember the formulas for margializig ad coditioig Gaussia probability distributio fuctios. Joit: Margials: Coditioals: [ ] ([ ] [ ] [ ]) x1 x1 µ1 Σ11 Σ p( ) = N, 12 x 2 x 2 µ 2 Σ 21 Σ 22 p(x 1 ) = N (µ 1, Σ 11 ) p(x 1 x 2 ) = N (x 1 m 1 2, V 1 2 ) m 1 2 = µ 1 + Σ 12 Σ 1 22 (x 2 µ 2 ) V 1 2 = Σ 11 Σ 12 Σ 1 22 Σ 21 Remider: Meas, Variaces ad Covariaces Remember the defiitio of the mea ad covariace of a vector radom variable: E[x] = xp(x)dx = m x Cov[x] = E[(x m)(x m) ] = (x m)(x m) p(x)dx = V x which is the expected value of the outer product of the variable with itself, after subtractig the mea. Symmetric. Also, the (cross)covariace betwee two variables: Cov[x, y] = E[(x m x )(y m y ) ] = C = (x m x )(y m y ) p(x, y)dxdy = C xy which is the expected value of the outer product of oe variable with aother, after subtractig their meas. Note: C is ot symmetric. Costraied Covariace Margial desity for factor aalysis (y is p-dim, z is k-dim): p(y θ) = N (y µ, ΛΛ +Ψ) So the effective covariace is the low-rak outer product of two log skiy matrices plus a diagoal matrix: Cov[y] Λ I other words, factor aalysis is just a costraied Gaussia model. (If Ψ were ot diagoal the we could model ay Gaussia ad it would be poitless.) It is easy to fid µ: just take the mea of the data. From ow o assume we have doe this ad re-cetred y. Λ T Ψ

EM for Factor Aalysis We will do maximum likelihood learig usig (surprise, surprise) the EM algorithm. E-step: q t+1 = p(z y, θ t ) M-step: θ t+1 = argmax θ z qt+1 (z y ) log p(y, z θ)dz For this we eed the coditioal distributio (iferece) ad the expected log of the complete data. Results: E step : q t+1 = p(z y, θ t ) = N (z m, V ) V = (I + Λ Ψ 1 Λ) 1 m = V Λ Ψ 1 (y µ) ( ) ( M step : Λ t+1 = y m Ψ t+1 = 1 N diag [ V ) 1 y y + Λ t+1 m y ] Iferece i Factor Aalysis Apply the Gaussia coditioig formulas to the joit distributio we derived above. This gives: p(z y) = N (z m, V) V = I Λ (ΛΛ + Ψ) 1 Λ m = Λ (ΛΛ + Ψ) 1 (y µ) Now apply the matrix iversio lemma to get: p(z y) = N (z m, V) V = (I + Λ Ψ 1 Λ) 1 m = VΛ Ψ 1 (y µ) y 3 µ y 1 y 2 y Complete Data Likelihood Write dow the joit distributio of z ad y: [ ] [ ] [ ] [ ] z z 0 I Λ p( ) = N (, y y µ Λ ΛΛ ) + Ψ where the corer elemets Λ, Λ come from Cov[z, y]: Cov[z, y] = E[(z 0)(y µ) ] = E[z(µ + Λz + µ) ] = E[z(Λz + ) ] = Λ This gives the complete likelihood (igorig mea): l c (Λ, Ψ) = N 2 log Ψ 1 z z 1 (y Λz ) Ψ 1 (y Λz ) 2 2 = N 2 log Ψ N 2 trace[sψ 1 ] S = 1 (y Λz )(y Λz ) N Matrix Iversio Lemma There is a good trick for ivertig matrices whe they ca be decomposed ito the sum of a easily iverted matrix (D) ad a low rak outer product. It is called the matrix iversio lemma. (D AB 1 A ) 1 = D 1 + D 1 A(B A D 1 A) 1 A D 1

Derivatives You eed these tricks to compute the M-step derivatives: log A = A A A trace[b A] = B A trace[ba CA] = 2CAB Gaussias are Footballs i High-D Recall the ituitio that Gaussias are hyperellipsoids. Mea == cetre of football Eigevectors of covariace matrix == axes of football Eigevalues == legths of axes I FA our football is a axis aliged cigar. I PCA our football is a sphere of radius σ 2. FA Ψ PCA ει Pricipal Compoet Aalysis I Factor Aalysis, we ca write the margial desity explicitly: p(y θ) = p(z)p(y z, θ)dz = N (y µ, ΛΛ +Ψ) z Noise Ψ mut be restricted for model to be iterestig. (Why?) I Factor Aalysis the restrictio is that Ψ is diagoal (axis-aliged). What if we further restrict Ψ = σ 2 I (ie spherical)? We get the Pricipal Compoet Aalysis (PCA) model: p(z) = N (z 0, I) p(y z, θ) = N (y µ + Λz, σ 2 I) where µ is the mea vector, colums of Λ are the pricipal compoets (usually orthogoal), ad σ 2 is the global sesor oise. Likelihood Fuctios For both FA ad PCA, the data model is Gaussia. Thus, the likelihood fuctio is simple: l(θ; D) = N 2 log ΛΛ + Ψ 1 (y µ) (ΛΛ + Ψ) 1 (y µ) 2 [ = N 2 log V 1 2 trace V 1 ] (y µ)(y µ) = N 2 log V 1 ] [V 2 trace 1 S V is model covariace; S is sample data covariace. I other words, we are tryig to make the costraied model covariace as close as possible to the observed covariace, where close meas the trace of the ratio. Thus, the sufficiet statistics are the same as for the Gaussia: mea y ad covariace (y µ)(y µ).

Fittig the PCA model The stadard EM algorithm applies to PCA also: E-step: q t+1 = p(z y, θ t ) M-step: θ t+1 = argmax θ z qt+1 (z y ) log p(y, z θ)dz For this we eed the coditioal distributio (iferece) ad the expected log of the complete data. Results: E step : q t+1 = p(z y, θ t ) = N (z m, V ) V = (I + σ 2 Λ Λ) 1 m = σ 2 V Λ (y µ) ( ) ( ) 1 M step : Λ t+1 = y m V [ σ 2t+1 = 1 y y + Λ t+1 ND i m y ] ii Iferece is Liear Recall the iferece formulas for FA: p(z y) = N (z m, V) V = I Λ (ΛΛ + Ψ) 1 Λ = (I + Λ Ψ 1 Λ) 1 m = Λ (ΛΛ + Ψ) 1 (y µ) = VΛ Ψ 1 (y µ) Note: iferece of the posterior mea is just a liear operatio! m = β(y µ) where β ca be computed beforehad give the model parameters. Also: posterior covariace does ot deped o observed data! cov[z y] = V = (I + Λ Ψ 1 Λ) 1 Direct Fittig For FA the parameters are coupled i a way that makes it impossible to solve for the ML params directly. We must use EM or other oliear optimizatio techiques. But for PCA, the ML params ca be solved for directly: The k th colum of Λ is the k th largest eigevalue of the sample covariace S times the associated eigevector. The global sesor oise σ 2 is the sum of all the eigevalues smaller tha the k th oe. This techique is good for iitializig FA also. We ca t make the sesor oise ucostraied, or else we would always get a perfect fit! Zero Noise Limit The traditioal PCA model is actually a limit as σ 2 0. The model we saw is actually called probabilistic PCA. However, the ML parameters Λ are the same. The oly differece is the global sesor oise σ 2. I the zero oise limit iferece is easier: orthogoal projectio. lim Λ (ΛΛ + σ 2 I) 1 = (Λ Λ) 1 Λ σ 2 0 y 3 µ y y 1 y 2

Scale Ivariace i Factor Aalysis I FA the scale of the data is uimportat: we ca multiply y i by α i without chagig aythig: µ i α i µ i Λ ij α i Λ ij Ψ i α 2 i Ψ i j However, the rotatio of the data is importat. FA looks for directios of large correlatio i the data, so it is ot fooled by large variace oise. FA Model Ivariace ad Idetifiability There is degeeracy i the FA model. Sice Λ oly appears as outer product ΛΛ, the model is ivariat to rotatio ad axis flips of the latet space. We ca replace Λ with ΛQ for ay uitary matrix Q ad the model remais the same: (ΛQ)(ΛQ) = Λ(QQ )Λ = ΛΛ. This meas that there is o oe best settig of the parameters. A ifiite umber of parameters all give the ML score! Such models are called u-idetifiable sice two people both fittig ML params to the idetical data will ot be guarateed to idetify the same parameters. PCA Rotatioal Ivariace i PCA I PCA the rotatio of the data is uimportat: we ca multiply the data y by ad rotatio Q without chagig aythig: µ Qµ Λ QΛ Ψ uchaged However, the scale of the data is importat. PCA looks for directios of large variace, so it will chase big oise directios. FA PCA Latet Covariace i Factor Aalysis ad PCA What if we allow the latet variable z to have a covariace matrix of its ow: p(z) = N (z 0, P)? We ca still compute the margial probability: p(y θ) = p(z)p(y z, θ)dz = N (y µ, ΛPΛ +Ψ) z We ca always absorb P ito the loadig matrix Λ by diagoalizig it: P = EDE ad settig Λ = ΛED 1/2. Thus, there is aother degeeracy i FA, betwee P ad Λ: we ca set P to be the idetity, to be diagoal, whatever we wat. Traditioally we break this degeeracy by either: set the covariace P of the latet variable to be I (FA) or force the colums of Λ to be orthoormal (PCA)

Mixtures of Dimesioality Reducers What s the ext logical step? Try a model that has two kids latet variables: oe discrete cluster, ad oe vector of cotiuous causes. Such models simultaeously do clusterig, ad withi each cluster, dimesioality reductio. Great idea! Idepedet Compoets Aalysis (ICA) ICA is aother cotiuous latet variable model, like FA, but it has a o-gaussia ad factorized prior o the latet variables. This is good i situatios where most of the factors are very small most of the time ad they do ot iteract with each other. Example: mixtures of speech sigals. The learig problem is the same: fid the weights from the factors to the outputs ad ifer the ukow factor values. I the case of ICA the factors are sometimes called sources, ad the learig is sometimes called umixig. Mixtures of Factor Aalyzers The simplest versio of this is the mixture of factor aalyzers. p(z) = N (z 0, I) p(k) = α k p(y z, k, θ) = N (y µ k + Λ k z, Ψ) p(y θ) = p(k)p(z)p(y z, k, θ)dz k z = N (y µ k, Λ k Λ k +Ψ) k Which is a costraied mixture of Gaussias. This is like a mixture of liear experts, usig a logistic regressio gate, eith missig iputs. Fittig procedure? EM, of course! see ftp.cs.toroto.edu/pub/zoubi/tr-96-1.ps.gz Geometric Ituitio Sice the latet variables are assumed to be idepedet, we are tryig to fid a liear trasformatio of the data that recovers these idepedet causes. Ofte we use heavy tailed source priors, e.g. p(z i ) 1/ cosh(z i ). Geometric ituitio: fidig spikes i histograms. x 2 0.5 0 Leared basis vectors 0.5 0.5 0 0.5 x 1

ICA Model The simplest form of ICA has as may outputs as sources (square) ad o sesor oise o the outputs: p(z) = p(z k ) k y = Vz Learig i this case ca be doe with gradiet descet (plus some covariat tricks to make the updates faster ad more stable). If you keep the square V ad use isotropic Gaussia oise o the outputs there is a simple EM algorithm, derived by Max Wellig ad Markus Weber. Much more complex cases have bee studied also: osquare, covolutioal, time delays i mixig, etc. But for that, we eed to kow about time-series...