INDEPENDENT COMPONENT ANALYSIS VIA

Similar documents
OPTIMISATION CHALLENGES IN MODERN STATISTICS. Co-authors: Y. Chen, M. Cule, R. Gramacy, M. Yuan

VARIABLE SELECTION AND INDEPENDENT COMPONENT

Least-Squares Regression on Sparse Spaces

WUCHEN LI AND STANLEY OSHER

Independent Component Analysis

Package ProDenICA. February 19, 2015

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

6 General properties of an autonomous system of two first order ODE

Function Spaces. 1 Hilbert Spaces

1. Aufgabenblatt zur Vorlesung Probability Theory

Introduction to Machine Learning

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

Tensors, Fields Pt. 1 and the Lie Bracket Pt. 1

Topic 7: Convergence of Random Variables

STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Some Examples. Uniform motion. Poisson processes on the real line

EFFICIENT MULTIVARIATE ENTROPY ESTIMATION WITH

Lecture 6: Generalized multivariate analysis of variance

7.1 Support Vector Machine

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Recent progress in log-concave density estimation

Gaussian processes with monotonicity information

Nonparametric estimation of log-concave densities

CHAPTER 1 : DIFFERENTIABLE MANIFOLDS. 1.1 The definition of a differentiable manifold

Algorithms and matching lower bounds for approximately-convex optimization

Nonparametric estimation: s concave and log-concave densities: alternatives to maximum likelihood

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys

Nonparametric estimation of. s concave and log-concave densities: alternatives to maximum likelihood

Witten s Proof of Morse Inequalities

A. Exclusive KL View of the MLE

DIFFERENTIAL GEOMETRY, LECTURE 15, JULY 10

Nonparametric estimation of log-concave densities

Multi-View Clustering via Canonical Correlation Analysis

Asymptotic estimates on the time derivative of entropy on a Riemannian manifold

Nonparametric Additive Models

Modelling and simulation of dependence structures in nonlife insurance with Bernstein copulas

A Spectral Method for the Biharmonic Equation

Robust Low Rank Kernel Embeddings of Multivariate Distributions

Lecture 2: Correlated Topic Model

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

MODELLING DEPENDENCE IN INSURANCE CLAIMS PROCESSES WITH LÉVY COPULAS ABSTRACT KEYWORDS

JUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Forest Density Estimation

Proof of SPNs as Mixture of Trees

Maximum likelihood estimation of a log-concave density based on censored data

Nonparametric Bayesian Methods (Gaussian Processes)

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Euler equations for multiple integrals

Convergence of Langevin MCMC in KL-divergence

Relation between the propagator matrix of geodesic deviation and the second-order derivatives of the characteristic function

Table of Common Derivatives By David Abraham

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Lecture 6 : Dimensionality Reduction

Cascaded redundancy reduction

NONPARAMETRIC LEAST SQUARES ESTIMATION OF A MULTIVARIATE CONVEX REGRESSION FUNCTION

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

The group of isometries of the French rail ways metric

Learning Mixtures of Gaussians with Maximum-a-posteriori Oracle

Nonparametric estimation under Shape Restrictions

Does Modeling Lead to More Accurate Classification?

A Simple Model for the Calculation of Plasma Impedance in Atmospheric Radio Frequency Discharges

Independent Component Analysis and Unsupervised Learning

Bayesian Estimation of the Entropy of the Multivariate Gaussian

STATS 306B: Unsupervised Learning Spring Lecture 12 May 7

Stat 5101 Lecture Notes

Classification Methods with Reject Option Based on Convex Risk Minimization

Lower bounds on Locality Sensitive Hashing

TRANSVERSAL SURFACES OF TIMELIKE RULED SURFACES IN MINKOWSKI 3-SPACE IR

Lecture 1 October 9, 2013

WEIGHTING A RESAMPLED PARTICLE IN SEQUENTIAL MONTE CARLO. L. Martino, V. Elvira, F. Louzada

Agmon Kolmogorov Inequalities on l 2 (Z d )

ICA and ISA Using Schweizer-Wolff Measure of Dependence

Statistical Data Mining and Machine Learning Hilary Term 2016

Shifted Independent Component Analysis

Parameter estimation: A new approach to weighting a priori information

Inter-domain Gaussian Processes for Sparse Inference using Inducing Features

Lecture 3 Notes. Dan Sheldon. September 17, 2012

A simple tranformation of copulas

Binary Discrimination Methods for High Dimensional Data with a. Geometric Representation

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Kernel Logistic Regression and the Import Vector Machine

Shape constrained estimation: a brief introduction

Advanced Introduction to Machine Learning

Scatter Matrices and Independent Component Analysis

Intrinsic Polynomials for Regression on Riemannian Manifolds

Homework 2 EM, Mixture Models, PCA, Dualitys

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

On conditional moments of high-dimensional random vectors given lower-dimensional projections

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Second order differentiation formula on RCD(K, N) spaces

Necessary and Sufficient Conditions for Sketched Subspace Clustering

SYMPLECTIC GEOMETRY: LECTURE 3

Real and Complex Independent Subspace Analysis by Generalized Variance

Euler Equations: derivation, basic invariants and formulae

Lecture 6: Calculus. In Song Kim. September 7, 2011

Transcription:

INDEPENDENT COMPONENT ANALYSIS VIA NONPARAMETRIC MAXIMUM LIKELIHOOD ESTIMATION Truth Rotate S 2 2 1 0 1 2 3 4 X 2 2 1 0 1 2 3 4 4 2 0 2 4 6 4 2 0 2 4 6 S 1 X 1 Reconstructe S^ 2 2 1 0 1 2 3 4 Marginal Densities 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 6 S^ 1 1 0 1 2 3 4 5 s Richar Samworth, University of Cambrige Joint work with Ming Yuan

What are ICA moels? ICA is a special case of a blin source separation problem, where from a set of mixe signals, we aim to infer both the source signals an mixing process; e.g. cocktail party problem. It was pioneere by Comon (1994), an has become enormously popular in signal processing, machine learning, meical imaging...

Mathematical efinition In the simplest, noiseless case, we observe replicates x 1,...,x n of X = A S, 1 1 where the mixing matrix A is invertible an S has inepenent components. Our main aim is to estimate the unmixing matrix W = A 1 ; estimation of marginals P 1,...,P of S = (S 1,...,S ) is a seconary goal. This semiparametric moel is therefore relate to PCA.

Different previous approaches Postulate parametric family for marginals P 1,...,P ; optimise contrast function involving (W,P 1,...,P ). Contrast usually represents mutual information or maximum entropy; or non-gaussianity (Eriksson et al., 2000, Karvanen et al., 2000). Postulate smooth (log) ensities for marginals (Bach an Joran, 2002; Hastie an Tibshirani, 2003; Samarov an Tsybakov, 2004, Chen an Bickel, 2006).

Our approach S. an Yuan (2012) To avoi assumptions of existence of ensities, an choice of tuning parameters, we propose to maximise the log-likelihoo log et W + 1 n n i=1 log f j (wj T x i ) j=1 over all non-singular matrices W = (w 1,...,w ) T, an univariate log-concave ensities f 1,...,f. To unerstan how this works, we nee to unerstan log-concave ICA projections.

Notation Let P k be the set of probability istributions P on R k with R k x P(x) < an P(H) < 1 for all hyperplanes H. Let F k be the set of upper semi-continuous log-concave ensities on R k. The conition P P is necessary an sufficient for the existence of a unique log-concave projection ψ : P F given by ψ (P) = argmax log f P. f F R (Cule, S. an Stewart, 2010; Cule an S., 2010; Dümbgen, S., Schuhmacher, 2011).

ICA notation Let W be the set of invertible matrices. The ICA moel P ICA consists of those P P with P(B) = j=1 P j (w T j B), Borel B, for some W W an P 1,...,P P 1. The log-concave ICA moel F ICA consists of f F with f(x) = et W j=1 f j (w T j x) with W W,f 1,...,f F 1. If X has ensity f F ICA, then w T j X has ensity f j.

Log-concave ICA projections Let ψ (P) = argmax f F ICA We also write L (P) = sup f F ICA log f P. R R log f P. The conition P P is necessary an sufficient for L (P) R an then ψ (P) efines a non-empty, proper subset of F ICA.

An example Suppose P is the uniform istribution on the unit Eucliean isk in R 2. Then ψ (P) consists of those f F ICA represente by an arbitrary W W an that can be f 1 (x) = f 2 (x) = 2 π (1 x2 ) 1/2 ½ {x [ 1,1]}.

Schematic picture of maps P P ICA ψ ψ ց ψ P ICA F F ICA

Log-concave ICA projection on P ICA If P P ICA, then ψ (P) efines a unique element of F ICA. The map ψ P ICA suppose that P P ICA coincies with ψ P ICA. Moreover,, so that P(B) = j=1 P j (w T j B), Borel B, for some W W an P 1,...,P P 1. Then f (x) := ψ (P)(x) = et W fj (wj T x), where f j = ψ (P j ). j=1

Ientifiability Comon (1994), Eriksson an Koivunen (2004) Suppose a probability measure P on R satisfies P(B) = j=1 P j (w T j B) = j=1 P j ( w T j B) Borel B, where W, W W an P 1,...,P, P 1,..., P are probability measures on R. Then there exists a permutation π an scaling vector ǫ (R \ {0}) such that P j (B j ) = P π(j) (ǫ j B j ) an w j = ǫ 1 j w π(j) iff none of P 1,...,P is a Dirac mass an not more than one of them is Gaussian. Consequence: If P P ICA, then ψ (P) is ientifiable iff P is ientifiable.

Convergence Suppose that P,P 1,P 2,... P satisfy (P n,p) 0, where enotes Wasserstein istance. Then sup inf f n f 0. f n ψ (P n ) f ψ (P) R If P P ICA is ientifiable an (W,P 1,...,P ) ICA P, then sup sup f n ψ (P n ) { (ǫ n j ) 1 wπ n n (j) w j + (W n,f1 n,...,fn )ICA f n inf ǫ n 1,...,ǫn R\{0} inf π n Π ǫ n j fπ n n (j) (ǫn j x) fj (x) } x 0, for each j = 1,...,, where f j = ψ (P j ). Consequently, for large n, every f n ψ (P n ) is ientifiable.

Estimation proceure Now suppose (W 0,P1 0,...,P 0) ICA P 0 P ICA, an we ii have ata x 1,...,x n P 0 with n + 1. We propose to estimate P 0 by ψ ( ˆP n ), where ˆP n is the empirical istribution of the ata. That is, we maximise l n (W,f 1,...,f ) = log et W + 1 n over W W an f 1,...,f F 1. n i=1 log f j (wj T x i ) j=1

Consistency Suppose P 0 is ientifiable. For any maximiser (Ŵ n, ˆf 1 n,..., ˆf n) of ln (W,f 1,...,f ), there exist ˆπ n Π an ˆǫ n 1,...,ˆǫn R \ {0} such that (ˆǫ n j ) 1 ŵ ṋ π n (j) a.s. w 0 j an ˆǫ n j ˆf π ṋ n (j) (ˆǫn j x) fj (x) x a.s. 0, for j = 1,...,, where f j = ψ (P 0 j ).

Pre-whitening Pre-whitening is a stanar pre-processing step in ICA algorithms to improve stability. We replace the ata with z 1 = ˆΣ 1/2 x 1,...,z n = ˆΣ 1/2 x n, an maximise the log-likelihoo over O O() an g 1,...,g F 1. If (Ôn,ĝ1 n,...,ĝn ) is a maximiser, we then set Ŵ n = ÔnˆΣ 1/2 an ˆfn j = ĝj n. Thus to estimate the 2 parameters of W 0, we first estimate the ( + 1)/2 free parameters of Σ, then maximise over the ( 1)/2 free parameters of O.

Equivalence of pre-whitene algorithm Suppose P 0 is ientifiable an R x 2 P 0 (x) <. With probability 1 for large n, a maximiser (Ŵ n, ˆfn 1,..., ˆfn ) of l n (W,f 1,...,f ) over W O()ˆΣ 1/2 an f 1,...,f F 1 exists. For any such maximiser, there exist ˆπ n Π an ˆǫ n 1,...,ˆǫ n R \ {0} such that (ˆǫ n j ) 1 ŵ ṋˆπ n (j) a.s. w 0 j an ˆǫ n j ˆfṋ ˆπ n (j) (ˆǫ n j x) fj (x) x a.s. 0, where f j = ψ (P 0 j ).

Computational algorithm With (pre-whitene) ata x 1,...,x n, consier maximising l n (W,f 1,...,f ) over W O() an f 1,...,f F 1. (1) Initialise W accoring to Haar measure on O() (2) For j = 1,...,, upate f j with the log-concave MLE of w T j x 1,...,w T j x n (Dümbgen an Rufibach, 2011) (3) Upate W using projecte graient step (4) Repeat (2) an (3) until negligible relative change in log-likelihoo.

Projecte graient step The set SO() is a ( 1)/2-imensional Riemannian submanifol of R 2. The tangent space at W SO() is T W SO() := {WY : Y = Y T }. The unique geoesic passing through W SO() with tangent vector WY (where Y = Y T ) is the map α : [0,1] SO() given by α(t) = W exp(ty ), where exp is the usual matrix exponential.

Projecte graient step 2 On [min(w T j x 1,...,w T j x n),max(w T j x 1,...,w T j x n)], we have log f j (x) = min (b jk x β jk ). k=1,...,m j For 1 < s < r <, let Y r,s enote the matrix with Y r,s (r,s) = 1/ 2, Y r,s (s,r) = 1/ 2 an zero otherwise. Then Y + = {Y r,s : 1 < s < r < } forms an o.n.b. for the skew-symmetric matrices. Let Y = { Y : Y Y + }. Choose Y max Y + Y to maximise the one-sie irectional erivative WY g(w), where g(w) = 1 n n i=1 j=1 min (b jk wj T x i β jk ). k=1,...,m j

Exp(1)-1 4 2 0 2 4 6 S 2 2 1 0 1 2 3 4 X 2 2 1 0 1 2 3 4 Truth Rotate 4 2 0 2 4 6 S 1 X 1 4 2 0 2 4 6 S^ 2 2 1 0 1 2 3 4 Marginal Densities 0.0 0.2 0.4 0.6 0.8 1.0 Reconstructe 1 0 1 2 3 4 5 S^ 1 s

0.7N( 0.9, 1) + 0.3N(2.1, 1) Truth Rotate S 2 4 2 0 2 4 X 2 4 2 0 2 4 6 4 2 0 2 4 6 6 4 2 0 2 4 6 S 1 X 1 Reconstructe S^ 2 4 2 0 2 4 Marginal Densities 0.00 0.10 0.20 0.30 6 4 2 0 2 4 6 S^ 1 4 2 0 2 4 s

Performance comparison LogConICA FastICA ProDenICA Amari Metric 0.0 0.2 0.4 0.6 0.8 1.0 Amari Metric 0.0 0.2 0.4 0.6 0.8 1.0 Amari Metric 0.0 0.2 0.4 0.6 0.8 1.0 Uniform Exponential t 2 LogConICA FastICA ProDenICA LogConICA FastICA ProDenICA LogConICA FastICA ProDenICA Amari Metric 0.0 0.2 0.4 0.6 0.8 Amari Metric 0.0 0.2 0.4 0.6 0.8 1.0 Mixture of Normal Binomial LogConICA FastICA ProDenICA

References Bach, F., Joran, M. I. (2002) Kernel inepenent component analysis. Journal of Machine Learning Research, 3, 1 48. Chen, A. an Bickel, P. J. (2006) Efficient inepenent component analysis, The Annals of Statistics, 34, 2825 2855. Comon, P. (1994) Inepenent component analysis, A new concept? Signal Proc., 36, 287 314. Cule, M., Samworth, R. (2010) Theoretical properties of the log-concave maximum likelihoo estimator of a multiimensional ensity. Electron. J. Stat., 4, 254-270. Cule, M., Samworth, R. an Stewart, M. (2010), Maximum likelihoo estimation of a multi-imensional log-concave ensity, J. Roy. Statist. Soc., Ser. B. (with iscussion), 72, 545-607.

Dümbgen, L. an Rufibach, K. (2011) logconens: Computations Relate to Univariate Log-Concave Density Estimation. J. Statist. Software, 39, 1 28. Dümbgen, L., Samworth, R. an Schuhmacher, D. (2011) Approximation by log-concave istributions, with applications to regression. Ann. Statist., 39, 702 730. Eriksson, J. an Koivunen, V. (2004) Ientifiability, separability an uniqueness of linear ICA moels. IEEE Signal Processing Letters, 11, 601 604. Hastie, T. an Tibshirani, R. (2003) Inepenent component analysis through prouct ensity estimation. In Avances in Neural Information Processing Systems 15 (Becker, S. an Obermayer, K., es), MIT Press, Cambrige, MA. pp 649 656. Hastie, T. an Tibshirani, R. (2003) ProDenICA: Prouct Density Estimation for ICA using tilte Gaussian ensity estimates. R package version 1.0. http://cran.r-project.org/web/packages/prodenica/. Samarov, A. an Tsybakov, A. (2004), Nonparametric inepenent component analysis. Bernoulli,

10, 565 582. Samworth, R. J. an Yuan, M. (2012) Inepenent component analysis via nonparametric maximum likelihoo estimation. http://arxiv.org/abs/1206.0457.