An Introduction to Independent Components Analysis (ICA)

Similar documents
Independent Component Analysis. PhD Seminar Jörgen Ungh

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CIFAR Lectures: Non-Gaussian statistics and natural images

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Independent Component Analysis

Independent Component Analysis

Independent Component Analysis (ICA) Bhaskar D Rao University of California, San Diego

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Advanced Introduction to Machine Learning CMU-10715

Independent Component Analysis. Contents

Introduction to Independent Component Analysis. Jingmei Lu and Xixi Lu. Abstract

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

STATS 306B: Unsupervised Learning Spring Lecture 12 May 7

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

HST.582J/6.555J/16.456J

Principal Component Analysis

Lecture'12:' SSMs;'Independent'Component'Analysis;' Canonical'Correla;on'Analysis'

Separation of Different Voices in Speech using Fast Ica Algorithm

Independent Component Analysis (ICA)

Introduction to Machine Learning

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Machine learning - HT Maximum Likelihood

Natural Image Statistics

Machine Learning (BSMC-GA 4439) Wenke Liu

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Linear Factor Models. Sargur N. Srihari

Independent Component Analysis and Unsupervised Learning

Different Estimation Methods for the Basic Independent Component Analysis Model

Principal Component Analysis vs. Independent Component Analysis for Damage Detection

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Unsupervised learning: beyond simple clustering and PCA

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Estimation of information-theoretic quantities

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Gaussian with mean ( µ ) and standard deviation ( σ)

Independent Component Analysis

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

20 Unsupervised Learning and Principal Components Analysis (PCA)

Tutorial on Blind Source Separation and Independent Component Analysis

Mining Big Data Using Parsimonious Factor and Shrinkage Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

An Improved Cumulant Based Method for Independent Component Analysis

Slide11 Haykin Chapter 10: Information-Theoretic Models

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Dimensionality Reduction

Recursive Generalized Eigendecomposition for Independent Component Analysis

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Bayesian Linear Regression [DRAFT - In Progress]

Independent Component Analysis

Dimension Reduction (PCA, ICA, CCA, FLD,

Independent Components Analysis

Independent Component Analysis

Dimensionality Reduction and Principle Components Analysis

Learning features by contrasting natural images with noise

Independent Component Analysis and Its Application on Accelerator Physics

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

A GUIDE TO INDEPENDENT COMPONENT ANALYSIS Theory and Practice

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Efficient Coding. Odelia Schwartz 2017

PATTERN RECOGNITION AND MACHINE LEARNING

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Data Mining Techniques

FEATURE EXTRACTION USING SUPERVISED INDEPENDENT COMPONENT ANALYSIS BY MAXIMIZING CLASS DISTANCE

Unsupervised Learning Methods

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Nonlinear reverse-correlation with synthesized naturalistic noise

Unsupervised Neural Nets

STA 4273H: Sta-s-cal Machine Learning

EECS490: Digital Image Processing. Lecture #26

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Basic Principles of Unsupervised and Unsupervised

New Approximations of Differential Entropy for Independent Component Analysis and Projection Pursuit

Lecture 7: Con3nuous Latent Variable Models

Independent Component Analysis of Incomplete Data

Eigenfaces. Face Recognition Using Principal Components Analysis

Lecture 11: Continuous-valued signals and differential entropy

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Data Mining Techniques

Principal Component Analysis CS498

Machine learning for pervasive systems Classification in high-dimensional spaces

Higher Order Statistics

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory

Principal Component Analysis (PCA)

Independent component analysis: algorithms and applications

Discriminative Direction for Kernel Classifiers

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata

Blind Source Separation Using Artificial immune system

CS 4495 Computer Vision Principle Component Analysis

Deriving Principal Component Analysis (PCA)

Subspace Methods for Visual Learning and Recognition

Independent Component Analysis and FastICA. Copyright Changwei Xiong June last update: July 7, 2016

Transcription:

An Introduction to Independent Components Analysis (ICA) Anish R. Shah, CFA Northfield Information Services Anish@northinfo.com Newport Jun 6, 2008 1

Overview of Talk Review principal components Introduce independent components 2

Part I: Principal Components Analysis (PCA) At each step, PCA finds the direction that explains li the most remaining i variation iti 3

1 st principal component of a cylinder 4

PCA on the Florida Keys 5

Principal components of 2 years of annual mutual fund returns 30 Returns of top (by 10 yr return) US diversified mutual funds 1 outlier excluded 25 20 2006 15 10 5 0 5 0 5 10 15 20 25 30 35 40 2007 6

Principal Component Analysis (PCA) Setup: Centered observations x 1.. x n Each observation is a vector of length t eg e.g. x 1 = IBM s monthly returns for the past 5 years (t = 60) x 1 = location of 1 acre of the Florida Keys (t = 2) x 1 = pixel intensities of a digital photo (t = 10 6 ) Centering happens by subtractingoff a) the mean of each observation (e.g. for stock covariance model) or b) the centroid of the set of observations (e.g. Florida Keys) Goal: Find the linear model that minimizes the squared error between the model and the data Think of regressing many stocks against a single independent variable PCA finds the independent variable that explains the most on average 7

PCA (cont.) Consider the familiar linear model x i(t) = β i y(t) + error Regression SE i = squared error for observation i= t=1..m [x i (t) β i y(t)] 2 Given y and x, regression sets β to minimize squared error: β i* = x it y/ y T y PCA TSE = total squared error over all observations = i=1..n SE i PCA finds the vector y (and the β i s) that minimize TSE To get the next component, repeat on the residuals {x i β i y} 8

PC s come from eigenvectors of the matrix of 2 nd moments C = X T X = (n n) matrix of 2 nd moments across t C i,j = s=1..t x i (s) x j (s) e.g. the co movement between securities i& j across time C = X X T = (t t) matrix of 2 nd moments across n C i,j = k=1..n x k (i) x k (j) e.g. the co movement between periods i& j across securities C d C h h i l d ildh C and C have the same non zero eigenvalues and yield the same PC s i.e. covariance of 5000 stocks over 60 months can be analyzed as a 60 60 matrix instead of a 5000 5000 9

PC s come from eigenvectors of the matrix of 2 nd moments (cont.) X = (t n) matrix of observations C = X T X (size n n) C= X X T (size t t) λ i = the i th largest eigenvalue of C or C v i = (n 1) normalized eigenvector of C corresponding to λ i v i = (t 1) normalized eigenvector of C corresponding to λ i C = i λ i v i v T i C = i λ i v i v T i v i = X T v i / (λ i ) ½ v i = X v i / (λ i ) ½ p i = (t 1) i th normalized principal component = v i = X v i / (λ i ) ½ e i = (n 1) exposures to the component = X T v i = (λ i ) ½ v i Average squared error explained by the component = λ i / (t n) 10

PCA (cont.) More volatile observations have greater impact in determining components To counteract, t reweight: iht x i (t) w i x i (t) e.g. w i = 1 / x i maximizes average correlation w i = sqrt(mkt cap i ) weights squared error by cap Exposures for the original observations are the reweighted observation s exposures divided by the weight 2 views of PCA a low dimensional representation of something high dimensional, e.g. stock return covariance a way to separate features from noise, e.g. extracting the structural part of stock returns PCAyields factorsuncorrelated with each other 11

Uncorrelated doesn t mean independent f(x) = x and g(x) = x 2 f and g uncorrelated, but g = f 2 20 15 10 5 0 5 0 5 5 12

An application: face recognition Training Start with a large # of pictures of faces Calculate the PC s of this set called eigenfaces For each person, take a reference photo and calculate its loadings on the PC s Model in operation Person looks into camera. The image s eigenface loadings are compared to the reference photo s Image Source: AT&T Laboratories Cambridge 13

Part II: Independent Components Goal: extract the signals driving a process Cocktail party problem separate the sound of several talkers into individual voices Stock market ktreturns extract tsignals that t investors use to price securities, fit predictive model for profit Cichocki, Stansell, Leonowicz, & Buck. "Independent variable selection: Application of independent component analysisto forecasting a stock index", Journal of Asset Management, v6(4) Dec 2005 Back & Weigand. A first application of independent component analysis to extracting structure from stock returns, International Journal of Neural Systems, v8(4) Aug 1997 14

6/9/2008 An example: Observe 4 linear combinations of 4 signals 15

6/9/2008 Principal Components 16

6/9/2008 Independent Components 17

ICA Independent Component Analysis Similar Setup Assume there exist independent signals S = [s 1 (t),, s n (t)] Observe only linear combinations of them, X(t) = A S(t) Both A and S are unknown! A is called the mixing matrix Goal Recover the original signals S(t) from X(t) ie. find a linear transformation L, ideally A 1, such that LX(t) = S(t) (up to scale and permutation) 18

ICA Basic idea First get rid of correlation whitening Apply a linear transformation N to decorrelate and normalize the signals: (NX) T NX= I. Let Z = NX The whitening transformation isn t unique any rotation of whitened signals is white: W rotation W T W = I (WZ) T (WZ) = Z T (W T W)Z = Z T Z = I Principal components are one source of whitened signals Then address higher order dependence Find a rotation W that makes the whitened signals independent, d i.e. the columns of WZ independent d The optimization problem is minimize W dep(wz) where dep(m) is a measure of the dependency between the columns of M st s.t. W T W = I (W is a rotation) 19

Notions of independence: 1. Nonlineardecorrelation Signals are already decorrelated by whitening E[u v] = E[u] E[v] Know that for independent signals u & v, E[g(u) h(v)] = E[g(u)] E[h(v)] for all g, h Are there functions ĝ & ĥ whose nonlinearity captures most of the higher order dependence? Ans: How well a particular function works depends on the shape of the data distribution ss tails dep(m) = magnitude of the difference between E[ĝ ĥ] and E[ĝ] E[ĥ] when ĝ & ĥ are applied to the columns of M 20

Notions of independence: 2. Non Gaussianity Model is X = AS wherethe columns ofs areindependent Central limit theorem says adding things together makes them more Gaussian Unmixed signals should be less Gaussian 21

A measure of non Gaussianity: Kurtosis Kurtosis 4 th centered moment / squared variance A measure of the mass in the distribution s tails Highly influenced by outliers Takes values from 0 to,, Gaussian is 3 Excess Kurtosis = Kurtosis 3 Takes values from 3 to, Gaussian is 0 Maximize the absolute value to find non Gaussian dep(m) = 11 excess kurtosis ofcolumns ofm 22

Background: Information theory (Shannon 1948) Entropy H(X) = p(x) log p(x) # of bits needed to encode X Differential Entropy h(x) = p(x) log p(x) dx For a given variance, Gaussians maximize differential entropy Kullback Leibler Divergence (Relative Entropy) D(p q) = p(x) log [p(x)/q(x)] 0 Mutual Information I(X,Y) = D[ p(x,y) p(x) p(y) ] = avg # of bits X tells about Y = avg # of bits Y tells about X I(X,Y) = 0 X & Y independent d 23

A measure of non Gaussianity: Negentropy Negentropy Shortfall in entropy relative to a Gaussian with the same variance Useful because scale invariant To evaluate, need probability distribution Estimate densities by expansions around a Gaussian density Cumulant (moment) based (Edgeworth, Gram Charlier) sensitive to outliers By non polynomial functions, eg. x exp( x 2 /2), tanh(x) more robust, but choice of functions depends d on the tails Maximize an approximation of negentropy dep(m) = 1 negentropyof columns of M 24

Mutual information Minimize the mutual information among the signals dep(m) = mutual information of columns of M After manipulating and constraining the signals to be uncorrelated, minimizing mutual information is maximizing negentropy 25

Another characterization: Maximum likelihood Recall model X = As let B = A 1 p(x) = det(b) i p i (b it X) Find demixing matrix B that maximizes the likelihood of the observations X First need an (inexact) model of the p s, similar to density approximation for negentropy 26

Connections to Human Wiring ICA can be characterized as sparse coding How can signals be represented compactly (each signal loading on a few of the factors) while retaining as much information as possible? A neuron codes only a few messages and rarely fires Edges are the independent components in pictures of nature Our visual system is built to detect edges 27

Summary Incredibly clever and powerful tool for extracting ti information F d l i l f Fundamental can motivate results from many different starting points 28

References Hyvärinen, A, J Karhunen, and E Oja. Independent Component Analysis. 2001 Stone, James. Independent Component Analysis: A Tutorial Introduction. 2004 Bishop, Christopher. Pattern Recognition and Machine Learning. 2007 Shawe Taylor, J and N Cristianini. Kernel Methods for Pattern Analysis. 2004 29