Independent Component Analysis and FastICA. Copyright Changwei Xiong June last update: July 7, 2016

Similar documents
Variations. ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra

Integrating Rational functions by the Method of Partial fraction Decomposition. Antony L. Foster

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Introduction to Independent Component Analysis. Jingmei Lu and Xixi Lu. Abstract

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Advanced data analysis

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

Mathematics Ext 2. HSC 2014 Solutions. Suite 403, 410 Elizabeth St, Surry Hills NSW 2010 keystoneeducation.com.

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

Independent Component Analysis. Contents

On one Application of Newton s Method to Stability Problem

Worksheets for GCSE Mathematics. Algebraic Expressions. Mr Black 's Maths Resources for Teachers GCSE 1-9. Algebra

CIFAR Lectures: Non-Gaussian statistics and natural images

Data Preprocessing. Jilles Vreeken IRDM 15/ Oct 2015

Lecture 3 Optimization methods for econometrics models

Lecture No. 1 Introduction to Method of Weighted Residuals. Solve the differential equation L (u) = p(x) in V where L is a differential operator

2.4 Error Analysis for Iterative Methods

Math, Stats, and Mathstats Review ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

CSC 578 Neural Networks and Deep Learning

General Strong Polarization

Constitutive Relations of Stress and Strain in Stochastic Finite Element Method

SECTION 5: POWER FLOW. ESE 470 Energy Distribution Systems

Separation of Different Voices in Speech using Fast Ica Algorithm

Prof. Dr.-Ing. Armin Dekorsy Department of Communications Engineering. Stochastic Processes and Linear Algebra Recap Slides

Worksheets for GCSE Mathematics. Quadratics. mr-mathematics.com Maths Resources for Teachers. Algebra

TECHNICAL NOTE AUTOMATIC GENERATION OF POINT SPRING SUPPORTS BASED ON DEFINED SOIL PROFILES AND COLUMN-FOOTING PROPERTIES

Independent Component Analysis and Its Application on Accelerator Physics

New Approximations of Differential Entropy for Independent Component Analysis and Projection Pursuit

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

M.5 Modeling the Effect of Functional Responses

Grover s algorithm. We want to find aa. Search in an unordered database. QC oracle (as usual) Usual trick

Unit II. Page 1 of 12

Independent component analysis: algorithms and applications

Lesson 1: Successive Differences in Polynomials

Independent Component Analysis

7.3 The Jacobi and Gauss-Seidel Iterative Methods

Classical RSA algorithm

(2) Orbital angular momentum

Gradient expansion formalism for generic spin torques

Independent Component Analysis. PhD Seminar Jörgen Ungh

Work, Energy, and Power. Chapter 6 of Essential University Physics, Richard Wolfson, 3 rd Edition

Radial Basis Function (RBF) Networks

INDEPENDENT COMPONENT ANALYSIS

Thick Shell Element Form 5 in LS-DYNA

Separation of the EEG Signal using Improved FastICA Based on Kurtosis Contrast Function

Angular Momentum, Electromagnetic Waves

An Introduction to Independent Components Analysis (ICA)

P.3 Division of Polynomials

A GUIDE TO INDEPENDENT COMPONENT ANALYSIS Theory and Practice

Lecture'12:' SSMs;'Independent'Component'Analysis;' Canonical'Correla;on'Analysis'

General Strong Polarization

Modeling of a non-physical fish barrier

10.1 Three Dimensional Space

Math 3 Unit 3: Polynomial Functions

Different Estimation Methods for the Basic Independent Component Analysis Model

One-unit Learning Rules for Independent Component Analysis

A Posteriori Error Estimates For Discontinuous Galerkin Methods Using Non-polynomial Basis Functions

SECTION 7: FAULT ANALYSIS. ESE 470 Energy Distribution Systems

Chapter 5: Spectral Domain From: The Handbook of Spatial Statistics. Dr. Montserrat Fuentes and Dr. Brian Reich Prepared by: Amanda Bell

Review of Linear Algebra

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

Lecture No. 5. For all weighted residual methods. For all (Bubnov) Galerkin methods. Summary of Conventional Galerkin Method

10.4 The Cross Product

Math 3 Unit 4: Rational Functions

Math 3 Unit 3: Polynomial Functions

Exam Programme VWO Mathematics A

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

STATS 306B: Unsupervised Learning Spring Lecture 12 May 7

On The Cauchy Problem For Some Parabolic Fractional Partial Differential Equations With Time Delays

Quantum state measurement

PHY103A: Lecture # 4

Estimation of cumulative distribution function with spline functions

Solving Fuzzy Nonlinear Equations by a General Iterative Method

Control of Mobile Robots

Dan Roth 461C, 3401 Walnut

Lecture 11. Kernel Methods

Blind Source Separation Using Artificial immune system

Secondary 3H Unit = 1 = 7. Lesson 3.3 Worksheet. Simplify: Lesson 3.6 Worksheet

Summary. Forward modeling of the magnetic fields using rectangular cells. Introduction

A brief summary of the chapters and sections that we cover in Math 141. Chapter 2 Matrices

CSE 5526: Introduction to Neural Networks Hopfield Network for Associative Memory

Lecture 3. Linear Regression

Hopfield Network for Associative Memory

7. Variable extraction and dimensionality reduction

ON CALCULATION OF EFFECTIVE ELASTIC PROPERTIES OF MATERIALS WITH CRACKS

Logarithmic Functions

Review for Exam Hyunse Yoon, Ph.D. Adjunct Assistant Professor Department of Mechanical Engineering, University of Iowa

Some examples of radical equations are. Unfortunately, the reverse implication does not hold for even numbers nn. We cannot

CHAPTER 5 Wave Properties of Matter and Quantum Mechanics I

Comparison of Fast ICA and Gradient Algorithms of Independent Component Analysis for Separation of Speech Signals

Principal Component Analysis vs. Independent Component Analysis for Damage Detection

Diffusive DE & DM Diffusve DE and DM. Eduardo Guendelman With my student David Benisty, Joseph Katz Memorial Conference, May 23, 2017

(1) Introduction: a new basis set

Review for Exam Hyunse Yoon, Ph.D. Assistant Research Scientist IIHR-Hydroscience & Engineering University of Iowa

Haar Basis Wavelets and Morlet Wavelets

ECE 6540, Lecture 06 Sufficient Statistics & Complete Statistics Variations

Unsupervised learning: beyond simple clustering and PCA

PHL424: Nuclear Shell Model. Indian Institute of Technology Ropar

Transcription:

Independent Component Analysis and FastICA Copyright Changwei Xiong 016 June 016 last update: July 7, 016 TABLE OF CONTENTS Table of Contents...1 1. Introduction.... Independence by Non-gaussianity....1. Differential Entropy...3.. Maximum Entropy Probability Distribution...3.3. Measure of Non-gaussianity: Negentropy...4.4. Approximation of Negentropy by Cumulants...5.5. Approximation of Negentropy by Nonpolynomial Functions...8 3. FastICA using Negentropy...1 3.1. Preprocessing for ICA...1 3.. Maximization of Negentropy...13 3.3. Symmetric Orthogonalization...15 References...16

Changwei Xiong, July 016 This note is to introduce independent component analysis and FastICA method developed by Hyvarinen et al [1]. 1. INTRODUCTION Independent component analysis (ICA) is a powerful technique for separating an observed multivariate signal into statistically independent non-gaussian components. For example, let s have independent (non-gaussian) sound sources installed at different locations in a room. Assuming no time delay and echoes, a microphone placed somewhere in the room can pick up an audio signal that is a linear mixture of the sound sources due to different proximities to the sound sources from the microphone. By sampling the audio signals from microphones located differently, the independent sound sources can be retrieved using ICA on the sampled audio signals. ICA shares certain similarity with the well-known principle component analysis (PCA) method. But unlike the PCA that imposes strong assumption of Gaussian features and seek variance maximization and ensure uncorrelatedness only by the first and second moments of the signal, the ICA exploits inherently non-gaussian features for independence and employs information from higher moments. For ICA to function effectively, the independent components of source signal must be non-gaussian. The nongaussianity is crucial, because a whitened multivariate Gaussian signal, in which the components are uncorrelated and of unit variance, has a completely symmetric multivariate density. The symmetricity provides no information on the directions of the columns of the mixing matrix and hence makes the matrix estimation impossible. Although the ICA model can still be applied to Gaussian signals, it can only be estimated up to an orthogonal transformation and the mixing matrix caot be identified.. INDEPENDENCE BY NON-GAUSSIANITY As mentioned, the key to estimating ICA model is non-gaussianity. From the central limit theorem, we know that the distribution of a sum of independent random variables tends toward a Gaussian distribution. In other words, a sum of two independent random variables usually has a distribution closer to Gaussian than any of the two original random variables. Hence identifying the independent source

Changwei Xiong, July 016 signal ss from the observed mixture vv is equivalent to finding a transformation matrix BB such that the nongaussianity of BBBB is maximized..1. Differential Entropy To find a good measure of the objective, i.e. non-gaussianity, we must introduce a quantity, known as entropy, for a probability distribution. In information theory, entropy is a measure of randomness. For a discrete random variable XX, it is defined by H[XX] = PP(XX = ξξ ii ) log PP(XX = ξξ ii ) ii (1) and for continuous random variables, in which case it is often called differential entropy, is defined by H[XX] = pp XX (ξξ) log pp XX (ξξ) dddd = EE[log pp XX (XX)] () Here we want to show one property of the differential entropy that is of special interest to us when we come to the ICA method. Suppose YY = ff(xx) is an invertible transformation from random vector XX to random vector YY, we want to find the coection between the entropy H[YY] and H[XX]. Firstly, the density of YY can be formulated as pp YY (yy) = pp XX ff 1 (yy) det JJ ff 1 (yy) (3) where JJ(ξξ) is the Jacobian matrix of function ff(xx) evaluated at ξξ. Hence the entropy H[YY] becomes H[YY] = EE[log pp YY (YY)] = EE log pp XX ff 1 (YY) = H[XX] + EE[log det JJ(XX) ] (4) det JJ ff 1 (YY) It shows the entropy is changed in the transformation by EE[log det JJ(XX) ] amount. In a special case where the transformation is linear, e.g. YY = MMMM, we obtain H[YY] = H[XX] + log det MM (5).. Maximum Entropy Probability Distribution Assume that XX is a continuous random variable with density pp XX (ξξ) and the information available on the density pp XX (ξξ) is of the form (e.g. if FF ii (ξξ) = ξξ, it gives the mean of XX) 3

Changwei Xiong, July 016 pp XX (ξξ)ff ii (ξξ)dddd = EE[FF ii (XX)] = cc ii ii = 1,, (6) we want to find a density pp XX (ξξ) that maximizes the entropy subject to the constraints in (6). To do so, we may consider the following functional JJ[pp XX (ξξ)] = pp XX (ξξ) log pp XX (ξξ) dddd + aa 0 pp XX (ξξ)dddd + aa ii pp XX (ξξ)ff ii (ξξ)dddd cc ii 1 (7) where the aa ii ii = 0,, are the Lagrange multipliers. The zeroth constraint ensures the total probability sums to 1. The entropy attains a maximum when the density function satisfies Euler-Lagrange equation, i.e. the functional derivative must equal zero δδjj[ff(ξξ)] δδδδ(ξξ) = dd dddd ff = 0 for JJ[ff(ξξ)] = LL ξξ, ff(ξξ), ff (ξξ) dddd (8) where ff stands for the first derivative of function ff. Since the ff does not appear explicitly in integrands of JJ[pp XX (ξξ)], the second term in Euler-Lagrange equation vanishes for all function ff and thus δδjj[pp XX (ξξ)] δδpp XX (ξξ) = log pp XX (ξξ) 1 + aa 0 + aa ii FF ii (ξξ) = 0 (9) Hence the density pp XX (ξξ) with maximum entropy under the constraints in (6) must be of the form pp XX (ξξ) = exp 1 + aa 0 + aa ii FF ii (ξξ) = AA exp aa ii FF ii (ξξ) (10) In the case that the constraints are limited to the mean EE[XX] = μμ and the variance EE[(XX μμ) ] = σσ, the (univarate) density of maximum entropy in (10) can be further derived into pp XX (ξξ) = 1 (ξξ μμ) exp σσ ππ σσ (11) which is the well-known Gaussian density..3. Measure of Non-gaussianity: Negentropy 4

Changwei Xiong, July 016 A fundamental result from previous discussion is that a Gaussian variable has the largest entropy among all random variables of equal mean and variance. This means that entropy could be used as a measure of non-gaussianity. To obtain such a measure that is zero for Gaussian and positive for all other random variables, one often uses a slightly modified version of the differential entropy, called negentropy Θ, which is defined as Θ[XX] = H[ZZ] H[XX] (1) where XX is an arbitrary (multivariate) random variable and ZZ is a (multivariate) Gaussian of the same covariance matrix as XX. It is evident that the negentropy is always non-negative and becomes zero if and only if XX is a Gaussian. An interesting property arises from (5) is that the negentropy (1) is invariant for an invertible linear transformation YY = MMMM, that is Θ[YY] = Θ[MMMM] = H[ZZ] + log det MM H[XX] log det MM = Θ[XX] (13) Negentropy appears to be a good measure of the non-gaussianity. However, the use of negentropy turns out to be computationally challenging. This is because, practically it is quite difficult to accurately and efficiently estimate the density function, which is the key ingredient in computation of the integral in (). Therefore, some approximations [], though possibly rather coarse, have to be used. In the following, we will introduce two of such approximation methods..4. Approximation of Negentropy by Cumulants The first approximation is based on cumulants of random variables. The cumulants [3] of a random variable XX, denoted by κκ [XX], are given by cumulant generating function CC XX (tt), which is defined as logarithm of characteristic function φφ XX (tt) of XX, that is CC XX (tt) = log φφ XX (tt). The characteristic function φφ XX (tt) is a Fourier transform of the density function and it can be used to generate raw moments νν [XX] φφ XX (tt) = EE[exp(iiiiii)] = 1 + νν (iiii) =1! and νν [XX] = EE[XX ] = ii φφ XX () (0) (14) where φφ XX () (0) denotes the -th derivative of the characteristic function evaluated at 0. Since the cumulant generating function CC XX (tt) is a composition of two functions, which have power series expansions itself, 5

Changwei Xiong, July 016 it also has a power series expansion (in Maclaurin series, i.e. a Taylor series centered at zero). The cumulant κκ [XX] can be derived from the CC XX (tt) by CC XX (tt) = log φφ XX (tt) = κκ [XX] (iiii)! =1 and κκ [XX] = ii CC XX () (0) (15) For example, when written in terms of the raw moments, the first 4 cumulants appear like κκ 1 = νν 1, κκ = νν νν 1, κκ 3 = νν 3 3νν νν 1 + νν 1 3 κκ 4 = νν 4 3νν 4νν 3 νν 1 + 1νν νν 1 6νν 1 4 (16) The approximation imposes an assumption that the random variable XX is close to a standard Gaussian. Let φφ(ξξ) be the density of a (univariate) standard Gaussian φφ(ξξ) = 1 ππ ξξ exp (17) Derivatives of the density φφ(ξξ) defines a series of functions h (ξξ), known as Hermite polynomials φφ () (ξξ) = ( 1) h (ξξ)φφ(ξξ) (18) For example, the first a few Hermite polynomials are as follows h 0 (ξξ) = 1, h 1 (ξξ) = ξξ, h (ξξ) = ξξ 1, h 3 (ξξ) = ξξ 3 3ξξ h 4 (ξξ) = ξξ 4 6ξξ + 3, h 5 (ξξ) = ξξ 5 10ξξ 3 + 15ξξ, (19) The Hermite polynomials form an orthogonal system, such that EE[h mm (ZZ)h (ZZ)] = φφ(ξξ)h mm (ξξ)h (ξξ)dddd =! if mm = 0 if mm R (0) where ZZ is a standard Gaussian. Because the density PP XX (ξξ) of random variable XX is close to φφ(ξξ), it can be approximated by Gram-Charlier expansion [4] truncated to include first two non-constant terms. The approximation of PP XX (ξξ) is defined in terms of cumulants κκ [XX] of random variable XX as follows pp XX (ξξ) pp XX(ξξ) = φφ(ξξ) 1 + κκ 3[XX]h 3 (ξξ) + κκ 4[XX]h 4 (ξξ) + (1) 6

Changwei Xiong, July 016 The next step is to estimate the entropy using the pp XX(ξξ) in (1). Since PP XX (ξξ) is close to φφ(ξξ), the cumulant κκ 3 [XX] and κκ 4 [XX] are small, we are able to apply the approximation log(1 + εε) = εε εε / + oo(εε 3 ) to get H[XX] pp XX(ξξ) log pp XX(ξξ) dddd φφ(ξξ) 1 + κκ 3[XX]h 3 (ξξ) 1 κκ 3[XX]h 3 (ξξ) + κκ 4[XX]h 4 (ξξ) log φφ(ξξ) + κκ 3[XX]h 3 (ξξ) + κκ 4[XX]h 4 (ξξ) + κκ 4[XX]h 4 (ξξ) dddd = φφ(ξξ) log φφ(ξξ) dddd + φφ(ξξ) log φφ(ξξ) κκ 3[XX]h 3 (ξξ) + κκ 4[XX]h 4 (ξξ) dddd =0 () φφ(ξξ) κκ 3[XX]h 3 (ξξ) + κκ 4[XX]h 4 (ξξ) dddd 1 φφ(ξξ) κκ 3[XX]h 3 (ξξ) =0 3 + 1 φφ(ξξ) κκ 3[XX]h 3 (ξξ) + κκ 4[XX]h 4 (ξξ) dddd H[ZZ] 1 κκ 3[XX] 0 + κκ 4[XX] + κκ 4[XX]h 4 (ξξ) dddd Here we have used the orthogonality in (0), and in particular the fact that h 3 (ξξ) and h 4 (ξξ) are orthogonal to any second-order polynomials. Furthermore, because both κκ 3 [XX] and κκ 4 [XX] are small, their third order monomials are much smaller than terms involving only second order monomials. Given the results in (), we may approximate the negentropy of random variable XX that is close to a Gaussian by Θ[XX] = H[ZZ] H[XX] = κκ 3[XX] 1 + κκ 4[XX] = EE[XX3 ] + kurt[xx] 48 1 48 where kurt[xx] = EE[XX 4 ] 3EE[XX ] (3) It is obvious that this approximation of negentropy is computationally very efficient. However, it possesses another issue that it is very sensitive to outliers. It mainly measures the tails and is largely 7

Changwei Xiong, July 016 unaffected by structure near the center of the distribution. In order to mitigate this issue, another more robust approximation of negentropy is developed as follows..5. Approximation of Negentropy by Nonpolynomial Functions Let s assume again that we have observed (or estimated) a number of expectations of different functions of XX as shown in (6). The function FF ii (ξξ) are in general not polynomials. Similarly we make a simple approximation pp XX (ξξ) of the maximum entropy density based on the assumption that the true density pp XX (ξξ) is not far from the Gaussian density of the same mean and variance. To make it simpler, let s assume that the XX has zero mean and unit variance so that we can put two additional constraints into (6), defined by FF +1 (ξξ) = ξξ, cc +1 = EE[XX] = 0 and FF + (ξξ) = ξξ, cc + = EE[XX ] = 1 (4) which makes the (10) into + pp XX (ξξ) = AA exp aa ii FF ii (ξξ) (5) To further simplify the calculations, let us make another, purely technical assumption: The functions FF ii (ξξ), form an orthonormal system by the measure φφ(ξξ) and are orthogonal to all polynomials up to second degree (similar to the orthogonality of Hermite polynomials). In other words, we have EE FF ii (ZZ)FF jj (ZZ) = φφ(ξξ)ff ii (ξξ)ff jj (ξξ)dddd EE[FF ii (ZZ)ZZ kk ] = φφ(ξξ)ff ii (ξξ)ξξ kk dddd 1 if ii = jj = δδ(ii, jj) ii = 1,, and δδ(ii, jj) = 0 if ii jj = 0 kk = 0, 1, where ZZ is a standard Gaussian. Due to the assumption of near-gaussianity, the exponential in (5) is not far from exp( ξξ /), thus all the other aa ii in (5) are small compared to aa + 1/. Hence we can make a first-order approximation of the exponential function using expansion exp(εε) 1 + εε + oo(εε ) + pp XX (ξξ) = AA exp aa ii FF ii (ξξ) = AA exp ξξ + aa +1ξξ + aa + + 1 ξξ + aa ii FF ii (ξξ) (7) (6) 8

Changwei Xiong, July 016 AAAA(ξξ) 1 + aa +1 ξξ + aa + + 1 ξξ + aa ii FF ii (ξξ) where AAAA(ξξ) = AA exp ξξ By orthogonality in (6) and use the following expectation of powers of standard Gaussian EE[ZZ] = 0, EE[ZZ ] = 1, EE[ZZ 3 ] = 0, EE[ZZ 4 ] = 3 (8) we can derive a series of equations as follows pp XX(ξξ)dddd = AAAA 1 + aa +1 ZZ + aa + + 1 ZZ + aa ii FF ii (ZZ) = AA 3 + aa + = 1 pp XX(ξξ)ξξξξξξ pp XX(ξξ)ξξ dddd = AAAA ZZ + aa +1 ZZ + aa + + 1 = AAAA ZZ + aa +1 ZZ 3 + aa + + 1 ZZ 3 + aa ii FF ii (ZZ)ZZ = AAaa +1 = 0 ZZ 4 + aa ii FF ii (ZZ)ZZ = AA 5 + 3aa + = 1 (9) pp XX(ξξ)FF ii (ξξ)dddd These equations can be solved to give = AAAA FF ii (ZZ) 1 + aa +1 ZZ + aa + + 1 ZZ + aa jj FF jj (ZZ) = AAaa ii = cc ii jj=1 AA = 1, aa ii = cc ii ii = 1,,, aa +1 = 0, aa + = 1 (30) The approximative maximum entropy density, denoted by pp XX(ξξ), can then be obtained from (7) using the solved constants, that is pp XX(ξξ) = φφ(ξξ) 1 + cc ii FF ii (ξξ) with cc ii = EE[FF ii (XX)] (31) Given the density above, we can approximate the differential entropy as H[XX] pp XX(ξξ) log pp XX(ξξ) dddd φφ(ξξ) 1 + cc ii FF ii (ξξ) log φφ(ξξ) + log cc ii FF ii (ξξ) 1 cc iiff ii (ξξ) dddd (3) 9

Changwei Xiong, July 016 H[ZZ] cc ii EE[log φφ(zz) FF ii (ZZ)] = H[ZZ] 1 cc ii =0 cc ii EE[FF ii (ZZ)] =0 1 cc iicc jj EE FF ii (ZZ)FF jj (ZZ) where another approximative expansion (1 + εε) log(1 + εε) εε + εε / + oo(εε 3 ) has been applied considering the quantity εε is much smaller than 1. The negentropy approximation is then given by ii,jj=1 Θ[XX] = 1 cc ii = 1 EE[FF ii(xx)] (33) There is a special and yet simple case of (31) can be obtained by using two functions: an odd function GG 1 (ξξ) and an even function GG (ξξ). The odd function measures the asymmetry and the even function measures the dimension of bimodality vs. peak at zero, closely related to sub- vs. supergaussianity [5]. First we define FF 1 (ξξ) and FF (ξξ) as FF 1 (ξξ) = GG 1(ξξ) + αααα λλ 1, FF (ξξ) = GG (ξξ) + ββ 1 + ββ ξξ λλ (34) Since GG 1 (ξξ) and GG (ξξ) are odd and even function respectively, and so are the FF 1 (ξξ) and FF (ξξ), the orthogonality EE[FF 1 (ZZ)FF (ZZ)] = 0 is automatically satisfied. To determine constants αα, ββ 1, ββ, λλ 1, λλ, the functions FF 1 (ξξ) and FF (ξξ) must satisfy the orthonormality in (6) by measure φφ(ξξ), which defines the following equations EE[FF 1 (ZZ) ] = EE[GG 1(ZZ) ] + αααα[gg 1 (ZZ)ZZ] + αα EE[ZZ ] λλ 1 = EE[GG 1(ZZ) ] + αααα[gg 1 (ZZ)ZZ] + αα λλ 1 = 1 EE[FF 1 (ZZ)ZZ] = EE[GG 1(ZZ)ZZ] + αααα[zz ] λλ 1 = EE[GG 1(ZZ)ZZ] + αα λλ 1 = 0 EE[FF (ZZ) ] = EE[GG (ZZ) ] + ββ 1 + ββ EE[ZZ 4 ] + ββ 1 EE[GG (ZZ)] + ββ 1 ββ EE[ZZ ] + ββ EE[GG (ZZ)ZZ ] λλ (35) = EE[GG (ZZ) ] + ββ 1 + 3ββ + ββ 1 EE[GG (ZZ)] + ββ 1 ββ + ββ EE[GG (ZZ)ZZ ] λλ = 1 10

Changwei Xiong, July 016 EE[FF (ZZ)] = EE[GG (ZZ)] + ββ 1 + ββ EE[ZZ ] λλ = EE[GG (ZZ)] + ββ 1 + ββ λλ = 0 EE[FF (ZZ)ZZ ] = EE[GG (ZZ)ZZ ] + ββ 1 EE[ZZ ] + ββ EE[ZZ 4 ] λλ = EE[GG (ZZ)ZZ ] + ββ 1 + 3ββ λλ = 0 The constants can be estimated from the equations above by EE[GG 1 (ZZ)ZZ] = αα, EE[GG 1 (ZZ) ] = λλ 1 + αα, EE[GG (ZZ)] = ββ 1 ββ EE[GG (ZZ)ZZ ] = ββ 1 3ββ, EE[GG (ZZ) ] = λλ + ββ 1 + ββ 1 ββ + 3ββ (36) Given the assumption (4) that XX has zero mean and unit variance, we can further write cc 1 = EE[FF 1 (XX)] = EE[GG 1(XX)] + αααα[xx] λλ 1 = EE[GG 1(XX)] λλ 1 cc = EE[FF (XX)] = EE[GG (XX)] + ββ 1 + ββ EE[XX ] = EE[GG (XX)] + ββ 1 + ββ = EE[GG (XX)] EE[GG (ZZ)] λλ λλ λλ (37) where the denominators are estimated as follows λλ 1 = EE[GG 1 (ZZ) ] EE[GG 1 (ZZ)ZZ] λλ = EE[GG (ZZ) ] EE[GG (ZZ)] 1 (EE[GG (ZZ)] EE[GG (ZZ)ZZ ]) (38) Hence the negentropy approximation in (33) can be expressed as Θ[XX] = cc 1 + cc = kk 1 EE[GG 1 (XX)] + kk (EE[GG (XX)] EE[GG (ZZ)]), kk 1 = 1 λλ 1, kk = 1 λλ (39) Note that in real applications, the assumption that the true density of XX is not too far from the Gaussian can be easily violated. However, even in the case that the approximation is not very accurate, the (39) can still be used to construct a non-gaussianity measure that is consistent in the sense that it is always noegative, and equal to zero if XX is a Gaussian. To make the approximation more robust, the functions GG ii (ξξ) must be chosen wisely. The selective criteria should consider: 1). estimation of EE[GG ii (XX)] should be statistically easy and insensitive to outliers. ). the GG ii (ξξ) must not grow faster than quadratically to ensure that the density in (31) is integrable and therefore to ensure that the maximum entropy distribution exists in the first place. 3). the GG ii (ξξ) must 11

Changwei Xiong, July 016 capture aspects of the distribution of XX that are pertinent in the computation of entropy. For example, if we use simple polynomials (which apparently are bad choices), we would end up with something very similar to what we have in the preceding section. This can be seen by letting GG 1 (ξξ) = ξξ 3 and GG (ξξ) = ξξ 4, then from (38) (40)we find λλ 1 = EE[ZZ 6 ] EE[ZZ 4 ] = 15 9 = 6 λλ = EE[ZZ 8 ] EE[ZZ 4 ] 1 (EE[ZZ4 ] EE[ZZ 6 ]) = 105 9 1 (3 15) = 4 (40) and hence the negentropy approximation in (39) becomes Θ[XX] = EE[XX3 ] 1 + (EE[XX4 ] 3) 48 (41) which is identical to that of (3). When using only one non-quadratic even function, an even simpler approximation of negentropy can be obtained. This amounts to omitting the term associated with the odd function in (39) to give Θ[XX] (EE[GG(XX)] EE[GG(ZZ)]) (4) It has been suggested [6] that the following two functions exhibit very good properties GG(ξξ) = 1 ξξ log cosh(aaaa) 1 aa and GG(ξξ) = exp (43) aa The first one is useful for general purposes, while the second on may be highly robust. 3. FASTICA USING NEGENTROPY In this section, we are going to summarize a numerical method, known as FastICA method [7], for ICA estimation. Suppose we are able to observe signal VV such that VV = MMMM (44) are linearly mixed by an invertible matrix MM of independent source signal SS. Both VV and SS are 1 vectors. We want to the independent components by maximizing the statistical independence of the estimated components. 3.1. Preprocessing for ICA 1

Changwei Xiong, July 016 Preprocessing the observed signal VV for ICA generally involves two steps: centering and whitening. Centering means to subtract the mean from the signal. Whitening is to linearly transform the observed signal so that the transformed signal is white, i.e. its components are uncorrelated and their variances equal unity. While centering appears simple, there can be many ways to perform whitening. One of the methods we want to discuss is the PCA whitening, which performs eigenvalue decomposition on covariance matrix Σ of the observed signal VV such that Σ[VV] = EEΛEE (45) Whitening can then be done on the centered signal by XX = Λ 1 EE (VV EE[VV]) (46) where XX is the whitened signal that will be subject to FastICA estimation. After centering and whitening, we can show that the signal XX has the following properties EE[XX] = 0 and EE[XXXX ] = II (47) where II is an identity matrix. 3.. Maximization of Negentropy Suppose we want to find an invertible matrix BB such that the linear transformation YY = BBXX produces a signal YY whose components are mutually independent with unit variance. The independence of the components requires that they are uncorrelated, and in the whitened space we must have II = EE[YYYY ] = EE[BBBBXX BB ] = BBBB (48) This means after whitening, the matrix BB can be taken to be orthonomal (i.e. orthogonal and normalized). Hence we want to maximize the negentropy of each component subject to the constraint that components are mutually independent. Let bb be the transpose of a row vector of matrix BB. We want to find the bb such that it maximizes the negentropy of the component on the unit sphere, i.e. bb bb = 1. The constrained maximization can be done by using Lagrange multiplier, that is 13

Changwei Xiong, July 016 L(bb, λλ) = (EE[GG(bb XX)] EE[GG(ZZ)]) + λλ(bb bb 1) (49) The first term on the right hand side comes from the approximation of negentropy and the second is the optimization constraint. The optimality condition requires L(bb, λλ) = γγee GG (bb XX)XX + λλbb = 0 where γγ = EE[GG(bb XX)] EE[GG(ZZ)] EE GG (bb XX)XX + λλ γγ bb = 0 (50) Since γγ is a scalar, the vector EE GG (bb XX)XX at maximum must be equal to vector bb multiplied by a scalar constant. In order words, the vector EE GG (bb XX)XX must point in the same direction of bb. Because in every iteration, the vector bb is normalized, this leads to a fixed point scheme bb Normalize EE GG (bb XX)XX (51) However, the fixed point iteration is often accompanied by slow convergence rate. Instead, the solution is sought by solving a nonlinear equation for an arbitrary scalar ββ EE GG (bb XX)XX + ββbb = 0 (5) using Newton s method. Denoting the left hand side of (5) by FF(bb) = EE GG (bb XX)XX + ββbb (53) the Newton s method defines another fixed point iteration, such that bb Normalize bb FF (bb) 1 FF(bb) = Normalize bb EE GG (bb XX)XX + ββbb (54) EE GG (bb XX) + ββ where the gradient matrix FF (bb) can be approximated by FF (bb) = EE GG (bb XX)XXXX + ββii EE GG (bb XX) EE[XXXX ] + ββii = EE GG (bb XX) + ββ II (55) given that the components of whitened XX are uncorrelated (though uecessarily independent). Since the quantity EE GG (bb XX) + ββ is a scalar and would be eventually eliminated by the normalization anyway, we can multiply both sides of (54) by the scalar and simplify the fixed point iteration to 14

Changwei Xiong, July 016 bb Normalize EE GG (bb XX)XX EE GG (bb XX) bb (56) Once the iteration converges (e.g. the Δbb < εε for a small number εε), then the bb XX is one of the independent components that we want to estimate. As a summary, we list the non-quadratic nonlinear candidate functions in (43) and their first and second derivatives as follows (1 aa ) GG(ξξ) = 1 aa log cosh(aaaa), GG (ξξ) = tanh(aaaa), GG (ξξ) = aa(1 tanh(aaaa) ) GG(ξξ) = exp ξξ, GG (ξξ) = ξξ exp ξξ, GG (ξξ) = (1 ξξ ) exp ξξ (57) 3.3. Symmetric Orthogonalization In certain applications, it may be desirable to use a symmetric decorrelation, in which no vectors are privileged over others. This means that the vectors bb s are not estimated one by one; instead, they are estimated in parallel by a fixed point iteration derived from (56) BB SymmetricOrthonomalize EE GG (BBXX)XX Diag EE GG (BBXX) BB (58) where the symmetric orthonomalization of matrix BB is given by SymmetricOrthonomalize(BB) = (BBBB ) 1 BB (59) with the inverse square root of matrix BBBB estimated from eigenvalue decomposition (BBBB ) 1 = EEΛ 1 EE for BBBB = EEΛEE (60) Alternatively, another iteration based symmetric orthonomalization can be done by first normalizing BB BB/ BB and then ruing the iteration BB 3 BB 1 BBBB BB (61) until the matrix BBBB is sufficiently close to an identity. 15

Changwei Xiong, July 016 REFERENCES 1. Hyvärinen, A; Oja, E., Independent Component Analysis: Algorithms and Applications, Neural Networks, London, 13(4-5),411-430, 000. Hyvärinen, A; Karhunen, J.; Oja, E., Independent Component Analysis, Wiley, 001, pp. 113-10 3. Wikipedia: https://en.wikipedia.org/wiki/cumulant 4. Wikipedia: https://en.wikipedia.org/wiki/edgeworth_series 5. Hyvärinen, A; Karhunen, J.; Oja, E., Independent Component Analysis, Wiley, 001, pp. 118 6. Hyvärinen, A; Karhunen, J.; Oja, E., Independent Component Analysis, Wiley, 001, pp. 184 7. Hyvärinen, A; Karhunen, J.; Oja, E., Independent Component Analysis, Wiley, 001, pp. 188-196 16