STATS 306B: Unsupervised Learning Spring Lecture 12 May 7

Similar documents
Independent Component Analysis

Lecture'12:' SSMs;'Independent'Component'Analysis;' Canonical'Correla;on'Analysis'

Unsupervised learning: beyond simple clustering and PCA

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Lecture 7: Con3nuous Latent Variable Models

CIFAR Lectures: Non-Gaussian statistics and natural images

Advanced Introduction to Machine Learning CMU-10715

Dimension Reduction (PCA, ICA, CCA, FLD,

Probabilistic Graphical Models

Principal Component Analysis

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004

An Introduction to Independent Components Analysis (ICA)

Introduction to Machine Learning

Independent Component Analysis

Rigid Structure from Motion from a Blind Source Separation Perspective

CS281 Section 4: Factor Analysis and PCA

Independent Components Analysis

Dimensionality Reduction

Introduction to Independent Component Analysis. Jingmei Lu and Xixi Lu. Abstract

Multidimensional scaling (MDS)

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

Separation of Different Voices in Speech using Fast Ica Algorithm

Factor Analysis (10/2/13)

Efficient Coding. Odelia Schwartz 2017

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

FA Homework 5 Recitation 1. Easwaran Ramamurthy Guoquan (GQ) Zhao Logan Brooks

1 Principal Components Analysis

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Machine Learning Techniques for Computer Vision

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Machine Learning (BSMC-GA 4439) Wenke Liu

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Independent Component Analysis. Contents

PCA and admixture models

Statistical Machine Learning

Autonomous Mobile Robot Design

INDEPENDENT COMPONENT ANALYSIS

Independent Component Analysis. PhD Seminar Jörgen Ungh

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

STA 4273H: Statistical Machine Learning

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

DIMENSIONALITY REDUCTION METHODS IN INDEPENDENT SUBSPACE ANALYSIS FOR SIGNAL DETECTION. Mijail Guillemard, Armin Iske, Sara Krause-Solberg

Independent Component Analysis (ICA) Bhaskar D Rao University of California, San Diego

Independent component analysis: algorithms and applications

Principal Component Analysis CS498

HST.582J/6.555J/16.456J

Latent Variable Models and EM algorithm

Massoud BABAIE-ZADEH. Blind Source Separation (BSS) and Independent Componen Analysis (ICA) p.1/39

Robustness of Principal Components

Principal Component Analysis

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

CS 4495 Computer Vision Principle Component Analysis

Independent Component Analysis

Tutorial on Blind Source Separation and Independent Component Analysis

Independent Component Analysis

Data Mining Techniques

Independent Component Analysis

Hands-On Learning Theory Fall 2016, Lecture 3

PRINCIPAL COMPONENT ANALYSIS

Factor Analysis and Kalman Filtering (11/2/04)

STA 414/2104: Machine Learning

Probabilistic & Unsupervised Learning

Independent Component Analysis

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Lecture 4: Principal Component Analysis and Linear Dimension Reduction

Vectors and Matrices Statistics with Vectors and Matrices

Donghoh Kim & Se-Kang Kim

Independent Component Analysis and FastICA. Copyright Changwei Xiong June last update: July 7, 2016

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Machine Learning - MT & 14. PCA and MDS

Maximum variance formulation

Probabilistic Graphical Models

Principal Component Analysis vs. Independent Component Analysis for Damage Detection

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

All you want to know about GPs: Linear Dimensionality Reduction

Announcements (repeat) Principal Components Analysis

Basics of Multivariate Modelling and Data Analysis

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Notes on Latent Semantic Analysis

Clustering with k-means and Gaussian mixture distributions

Independent Component Analysis for Redundant Sensor Validation

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

TRINICON: A Versatile Framework for Multichannel Blind Signal Processing

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Intelligent Data Analysis Lecture Notes on Document Mining

Independent Component Analysis and Unsupervised Learning

VARIABLE SELECTION AND INDEPENDENT COMPONENT

STA 414/2104: Lecture 8

Transcription:

STATS 306B: Unsupervised Learning Spring 2014 Lecture 12 May 7 Lecturer: Lester Mackey Scribe: Lan Huong, Snigdha Panigrahi 12.1 Beyond Linear State Space Modeling Last lecture we completed our discussion of linear Gaussian state space models. While these techniques are broadly applicable, they are not appropriate in all settings. Indeed, in many settings, non-linear / non-gaussian transition and emission models are much more appropriate. For example, in weather forecasting we are interested in modeling specific weather outcomes at future timepoints, but our model of dynamics is decidedly nonlinear, being determined by complex geophysical simulations. In pose estimation we are interested in tracking the precise pose of different body parts over time, but our observations (images or frames of a video) are complex function of the 3D poses, nearby objects, lightning conditions, and camera calibration. Unfortunately, in non-linear setting our filtering, prediction, and other inferential conditional distributions are typically quite complex and cannot be computed in closed form. To tackle this issue a various approximate nonlinear filters for approximate inference in these models have been developed, including histogram filters, extended and unscented Kalman filters, and particle filters. We will not have the time to discuss any of these approximate filters in detail, but a brief summary of their properties is found in the slides. 12.2 Independent Component Analysis We now turn our attention to a new latent feature modeling approach, Independent Component Analysis, that, as usual, seeks to find the latent components and loadings that underlie our data but that employs a very different objective in selecting the loadings and components. In the terminology of the source community, ICA describes a class of related methods designed to separate a linearly mixed multivariate signal into additive non-gaussian source signals (i.e., components), where the source signals are encouraged to be independent of one other. 12.2.1 Example ICA applications Please see the accompanying slides. The cocktail party problem In the cocktail party problem, p microphones are positioned in different locations in a room; each microphone records a different mixture of the audio signals emanating from the mouths of the party guests (i.e., the different conversations). Our goal is to recover the individual conversations from those mixtures (which often sound 12-1

like incomprehensible hums). ICA allows you to retrieve p independent sources of sound (p voices) from recordings from p microphones placed around the room. Natural images representation In this example, each square patch of an image is considered to be a datapoint (with pixels as the vector coordinates). The collection of these vectors for an image forms an image database. Our goal is to find a more natural basis for these images than the input pixel basis. ICA provides such a basis in the columns of the mixing matrix A. The slide shows the learned basis vectors; they capture visual structures like edges and contrast that are known stimuli of the primary visual cortex in humans. 12.2.2 Model formulation In ICA the datapoints x 1, x 2,..., x n R p (e.g., x i may be a the audio signal recorded by each of p microphones at time i) are viewed as independent realizations of the simple probabilistic model x = As which satisfies the following assumptions: 1. s is a latent random vector in R p which has zero mean, E[s] = 0, independent components, and identity covariance, Cov(s) = I. 2. The parameter A is an unknown mixing matrix R p p. Hence, x is a random vector in R p with Cov(x) = AA. In our usual generative model notation, we could equivalently write s i iid P x i = As i (for P an unknown distribution on R p with independent components) (for unknown A). Note that s i is analogous to the latent features z i discussed in prior models, while A is analagous to the loading matrix U discussed in prior models. See the slides for a discussion of the differences between modeling assumptions in ICA and Gaussian models like FA or PPCA. 12.2.3 Preprocessing Before applying ICA, one typically mean-centers and whitens the data matrix. Whitening is a transformation that decorrelates a set of random variables. Suppose we have a set of data y with a known covariance matrix, Cov[y] = Σ, and mean E[y] = 0. Then the whitened form of y is ỹ = Σ 1/2 y with covariance Cov[ỹ] = I. Since we do not have access to the true distribution of X, only to the sample x 1,..., x n of this distribution, we replace all expectations over X in ICA with empirical expectations over the dataset. In this context, this means that we mean-center and pre-whiten each datapoint using the empirical mean, x n = 1 n n i=1 x i, and empirical covariance, Q = 1 n 1 n i=1 (x i 12-2

x n )(x i x n ) T. Once we do this scaling, we can restrict our search of the mixing matrix A to the set of orthogonal matrices, since for pre-whitened x, we have I = Cov[x] = AA T. Hence our final ICA goal is to find W = A 1 such that for s = W x, the coordinates of s are minimally dependent. 12.2.4 Entropy and Mutual Information Our notion of minimal dependence will rely on the concepts of entropy and mutual information. Below we suppose that y R p is a random vector probability density function f. The entropy H(y) of y is given by H(y) = f(υ) log f(υ)dυ. This can be viewed as a measure of the randomness in a distribution or as a measure of closeness of the distribution to uniformity. An important fact is that a N (µ, Σ) distribution has the maximum entropy amongst all distributions with the mean µ and covariance Σ. The mutual information I(y) of y is a measure of departure from independence that can defined in terms of entropies I(y) = H(y j ) H(y), where H(y j ) is a marginal entropy of component y j with density f j. I(y) can be also expressed as the Kullback-Leibler divergence between f(υ) and the independent version of the density, p f j(υ j ). 1 12.2.5 An Objective for ICA We are now prepared to define a target objective for ICA on mean-centered, prewhitened x. A useful fact is that whenever W is orthogonal, we have I(s) = I(W x) = H(wj T x) H(W x) = H(wj T x) H(x) log W = H(wj T x) H(x) where w1 T w2 T W =. wp T 1 This is the distribution with independent coordinates closest to the distribution of Y in KL divergence. 12-3

is orthogonal. Hence, our goal of minimizing dependence amongst coordinates can be phrased as This problem is equivalent to min I(W x). orthogonal W which is again equivalent to minimizing dependence amongst components w T j x = s j which finally reduces to minimizing sums of of the entropies of w T j x = s j maximizing summed departures of w T j x = s j distributions from Gaussianity. Unfortunately, it is difficult to optimize this objective directly, but many methods are available to either approximately optimize this objective or to optimize approximate objectives. 12.2.6 Negentropy and FastICA FastICA is one popular approach that optimizes an approximation to our target mutual information objective. More precisely, the FastICA paper defines an equivalent target objective called negentropy. For a single random variable Y j, negentropy is an explicit measure of departure from Gaussianity defined as J(Y j ) = H(Y j ) H(Z j ) where Z j N(0, Var(Y j )). Note that minimizing I(W x) over orthogonal W is equivalent to minimizing p J(w j x) over orthogonal W. Moreover, since, Cov(x) = I and W is orthogonal, so Var(s j ) = Var(wj x) = 1 for all j. FastICA extracts an initial loadings vector w 1 by approximately minimizing an approximation to J(w1 T x) J(w1 T x) (E[G(w1 T x)] E[G(Z 1 )]) 2 where G(y) = 1 log cosh(ay) for some a [1, 2], and where the population expectation over a x is replaced by an empirical expectation over our data: Empirical objective: ( 1 n n G(w1 T x i ) E[G(Z 1 )]) 2. The details of the Newton-type algorithm proposed to optimize this objective are given in today s ICA reading. This technique yields the first loadings vector w 1. Additional loadings vectors w j can be obtained in succession by minimizing the equivalent approximation to J(wj T x) under the constraint that wj w l for all l < j. i=1 12-4

12.3 Canonical Correlation Analysis (CCA) We will now consider a slightly different setting for linear dimensionality reduction in which we are given two paired mean-centered datasets, X and Y, where x 1 y 1 x 2 X =. and Y = y 2., x n y n for x i R p, y i R q. Each pair (x i, y i ) represents two views of the same entity. For example, in image retrieval x i might be a pixel-based or visual representation of an image, and y i may be a keyword or textual description of the same image. Another relevant setting is that of time series analysis where x i might be the signal at time i and y i might be the often highly related signal at time i + 1. Canonical correlation analysis (CCA) is a linear dimensionality reduction technique dating back to Hotelling in 1936 that attempts to reduce dimensionality of the two data views jointly. This is fruitful when the relationship between the two views reflects useful latent structure underlying each views. Let us introduce a bit of new notation to make our future calculations more apparent. Hereafter we will let x represent a random vector distributed uniformly over the datapoints {x 1,..., x n }, so that it takes on value x i with probability 1/n. The same holds for y. The CCA objective is to find projection directions called canonical directions u and v such that the correlation between u T x and v T y is maximized. That is, in CCA, we focus on how the canonical variables u T x and v T y are related and not on how much they vary individually (which is the primary concern of PCA). Note that Var(u T x) = 1 n n u T x i x T i u = 1 n ut X T Xu i=1 Corr(u T x, v T y) = Cov(u T x, v T y) = 1 n n u T x i yi T v = 1 n ut X T Y v i=1 Cov(u T x, v T y) V ar(ut x)v ar(v T y) = u T X T Y v ut X T Xu v T Y T Y v. This expression makes clear that the canonical variables are invariant to rotations or scalings of either data set. 12-5

12.3.1 Solving CCA The CCA optimization problem is the following maximization max u,v u T X T Y v ut X T Xuv T Y T Y v max u,v s.t. Xu 2 =1, Y v 2 =1 u,v s.t. max X 0 u 0 Y v = 2 2 (12.1) u T X T Y v (12.2) ( u T v ) ( ) ( ) T 0 X T Y u Y T, X 0 v (12.3) where the second formulation takes the form of a standard generalized singular value problem, and the third takes the form of a generalized eigenvalue problem. The equivalence of (12.2) and (12.3) may not be apparent, as we constrained both Xu 2 and Y v 2 in (12.2) and constrained only the square root of their sum in (12.3). However, we will see that the solution to (12.3) must satisfy Xu 2 = Y v 2, which implies that the two formulations are in fact equivalent. Indeed, the solution ( ) to the generalized eigenvalue u problem (12.3) is given by the generalized eigenvector which satisfies ( 0 X T Y Y T X 0 ) ( ) u = λ ( X T X 0 0 Y T Y ) ( ) u for the largest value of λ (the associated generalized eigenvalue). If this λ = 0, this expression implies that u T X T Xu = 0 = T Y T Y. Otherwise, u T X T Xu = u T X T Y 1 = λ T Y T Y. In either case, we have the advertised equality Xu 2 = Y 2. Note moreover that if X T X and( Y T ) Y are invertible, the above problem reduces to a u standard eigenvalue problem finding satisfying for the largest λ. ( ) ( (X T X) 1 0 0 X T Y 0 (Y T Y ) 1 Y T X 0 ) ( ) u ( ) u = λ 12-6