File: ica tutorial2.tex. James V Stone and John Porrill, Psychology Department, Sheeld University, Tel: Fax:

Size: px
Start display at page:

Download "File: ica tutorial2.tex. James V Stone and John Porrill, Psychology Department, Sheeld University, Tel: Fax:"

Transcription

1 File: ica tutorial2.tex Independent Component Analysis and Projection Pursuit: A Tutorial Introduction James V Stone and John Porrill, Psychology Department, Sheeld University, Sheeld, S 2UR, England. Tel: Fax: j.porrill, j.v.stone@shef.ac.uk Web: April 2, 998 Abstract Independent component analysis (ICA) and projection pursuit (PP) are two related techniques for separating mixtures of source signals into their individual components. These rapidly evolving techniques are currently nding applications in speech separation, ERP, EEG, fmri, and low-level vision. Their power resides in the simple and realistic assumption that dierent physical processes tend to generate statistically independent signals. We provide an account that is intended as an informal introduction, as well as a mathematical and geometric description of the methods. Introduction Independent component analysis (ICA) [Jutten and Herault, 988] and projection pursuit (PP) [Friedman, 987] are methods for recovering underlying source signals from linear mixtures of these signals. This rather terse description does not capture the deep connection between ICA/PP and the fundamental nature of the physical world. In the following pages, we hope to establish, not only that ICA/PP are powerful and useful tools, but that this power follows naturally from the fact that ICA/PP are based on assumptions which are remarkably attuned to the spatiotemporal structure of the physical world.

2 Most measured quantities are actually mixtures of other quantities. Typical examples are, i) sound signals in a room with several people talking simultaneously, ii) an EEG signal, which contains contributions from many dierent brain regions, and, iii) a person's height, which is determined by contributions from many dierent genetic and environmental factors. Science is, to a large extent, concerned with establishing the precise nature of the component processes responsible for a given set of measurements, whether these involve height, EEG signals, or even IQ. Under certain conditions, the underlying sources of measured quantities can be recovered by making use of methods (PP and ICA) based on two intimately related assumptions. The more intuitively obvious of these assumptions is that dierent physical processes tend to generate signals that are statistically independent of each other. This suggests that one way to recover source signals from signal mixtures is to nd transformations of those mixtures that produce independent signal components. This independence is given much emphasis in the ICA literature, although an apparently subsidiary assumption that source signals have amplitude histograms that are non-gaussian is also required. In (apparent) contrast, the PP method relies on the assumption that any linear mixture of any set of (nite variance) source signals is Gaussian, and that the source signals themselves are not Gaussian. Thus, another method for extracting source signals from linear mixtures of those signals is to nd transformations of the signal mixtures that extract non- Gaussian signals. It can be shown that the assumption of statistical independence is implicit in the assumption that source signals are non-gaussian, and therefore that both PP and ICA are actually based on the same assumptions. Within the literature, PP is used to extract one signal at a time, whereas ICA extracts simultaneously a set of signals. However, like the apparently dierent assumptions of PP and ICA, this dierence is supercial, and reects the underlying histories of the two methods, rather than any fundamental dierence between them. Recent applications of ICA include separation of dierent speech signals [Bell and Sejnowski, 995], analysis of EEG data [Makeig et al., 997], functional magnetic resonance imaging (fmri) data [McKeown et al., 998], image processing [Bell and TJ, 997], and the relation between biologicial image processing and ICA [van Hateren and van der Schaaf, Setting the Scene Before becoming too embroiled in the intricacies of ICA, we need to establish the class of problems they can address. Given N time-varying source signals, we dene the amplitudes of these signals at time t as a column vector s t = fst; :::; s Nt g. These signals can be linearly combined to form a signal mixture x t = as t, where each element of the row vector a species how much of the corresponding source signal s it contributes to the signal mixture x t. Given M signal mixtures x t = fxt; :::; x Mt g T we can dene a mixing matrix A = fa ; :::; a M g T in which each row a i species a unique mixture x it of the signals s t = fst; :::; s Nt g. (Note that the t subscript denotes time, whereas the T superscript denotes the transpose operator). Using this matrix notation, the 2

3 formation of M signal mixtures from N source signals can be written as: x t = As t () Both ICA and PP are capable of taking the signal mixtures x and recovering the sources s. That the mixtures can be separated in principle is easily demonstrated now that the problem has been summarised in matrix algebra. An `unmixing' matrix W is dened such that: s t = Wx t (2) Given that each row in W species how the mixtures in x are recombined to produce one source signal, it follows that it must be possible to recover one signal at a time by using a dierent row vector to extract each signal. For example, if only one signal is to be extracted from M signal mixtures then W is a M matrix. Thus, the shape of the unmixing matrix W depends upon how many signals are to be extracted. Usually, ICA is used to extract a number of sources simultaneously, whereas PP is used to extract one source at a time. 3 The Geometry of Source Separation The nature of the linear `unmixing' transformation matrix W can be conveniently explored in terms of two signal mixtures. Consider two signals s = fs; s2g that have been mixed with a 2 2 mixing matrix A to produce two signal mixtures x = As, where x = fx; x2g. If we interpret this in terms of two voices (sources) and two microphones, then elements of the ith row in A specify the proximity of each voice to the jth microphone. Each microphone records a weighted mixture x i of the two sources s and s2, where the weightings for each microphone are given by a column of A. Plots of s versus s2, and of x versus x2 can be seen in Figure. We dene the space in which s exists as S with axes S and S2, and x's space is dened as X with axes X and X2. The amplitudes of the source signals st and s2t at time t are represented as a point with coordinates (st; s2t) in S. The corresponding amplitudes of the signal mixtures x = As at time t are represented as a point x t with coordinates (xt; x2t) in X. The `mixing' matrix A denes a linear transformation, so that the mapping from S to X consists of a rotation and shearing of axes in S. Thus, the orthogonal axes S and S2 in S appear as two skewed lines S and S 2 in X (see Figure b). Note that variation along each axis in S is caused by variation in the amplitude of one source signal. Given that each axis S i in S corresponds to a direction Si in X, variation along the projected axes S and S2 in X are caused by variation in the signal amplitudes dened as s and s2, respectively. If we can extract variations associated with one direction, say, S, in X whilst ignoring variations along all other directions, then we can recover the amplitudes of the signal s. This can be achieved by projecting all points in X onto a line In which the rows of A form the basis vectors in X. 3

4 that is orthogonal to all but one direction S. Such a line is dened by a vector w = (w; w2) (depicted as a dashed line in Figure b), dened so that only components of X that lie along the direction S are transformed to non-zero values of y = w x. This is depicted graphically in Figure b, with the result of unmixing both signals y = Wx depicted in Figure 2. To summarise, the linear transformation y t = Wx t produces a scalar value for each point x t in X, so that a single signal results from the transformation y = Wx. The signal amplitude y t at time t is found by taking the inner product of W with a point x t. As the row vector W is dened to be orthogonal to directions corresponding all but one source signal in X, only that signal will be projected to non-zero values y = Wx. Having demonstrated that an unmixing matrix W exists that can extract one or more source signals from a mixture, the following sections describe how PP and ICA can be used to obtain values for W. 4 Independence and Moments of Non-Gaussian Signals 4. Independence and Correlation Statistical independence lies at the core of the ICA/PP methods. Therefore, in order to understand ICA/PP, it is essential to understand independence. At an intuitive level, if two variables x and y are independent then the value of one variable cannot be predicted if the value of the other variable is known. One simple way to understand independence relies on the more familiar denition of correlation. The correlation between two variables x and y is: (x; y) = Cov(x; y) x y (3) where x and y are the standard deivations of x and y, respectively, and Cov(x; y) is the covariance between x and y: Cov(x; y) = (=n) X i where x and y are the means of x and y, respectively. (x i? x )(y i? y ) (4) Correlation is simply a form of covariance that has been normalised to lie in the range f?; +g. Note that if two variables x and y are uncorrelated then (x; y) = Cov(x; y) =, although (x; y) and Cov(x; y) are not equal in general. The covariance Cov(x; y) can be shown to be: Cov(x; y) = (=n) X i x i y i? (=n) X i x i (=n) X i y i (5) Each term in Equation (5) is a mean, or expected value E, and can be written more succinctly as: Cov(x; y) = E[xy]? E[x]E[y] (6) 4

5 A histogram plot with abscissas x and y, and with the ordinate denoting frequency, approximates the probability density function (pdf) of the joint distribution of xy. The quantity E[xy] is known as a second moment of this joint distribution. Similarly, histograms of x and y approximate their respective pdfs, and are known as the marginal distributions of the joint distribution xy. The quantities E[x] and E[y] are the rst moments (respectively) of these marginal distributions. Thus, covariance is dened in terms of moments associated with the joint distribution xy. Just because x and y are uncorrelated, this does not imply that they are independent. To take a simple example, given a variable z = f; :::; 2g, we can dene x = sin(z) and y = cos(z). Intuitively, it can be seen that both x and y depend on z. As can be sen from Figure 3, the variables x and y are highly interdependent. However, the covariance (and therefore the correlation) of x and y is zero: Cov(x; y) = E[xy]? E[x]E[y] (7) = E[cos(z) sin(z)]? (8) = (9) In summary, covariance does not capture all types of dependencies between x and y, whereas measures of statistical independence do. Like covariance, independence is dened in terms of the expected values of the joint distribution xy. We have established that if x and y are uncorrelated then they have zero covariance: E[xy]? E[x]E[y] = () Using a generalised form of covariance involving powers of x and y, if x and y are statistically independent then: E[x p y q ]? E[x p ]E[y q ] = () for all positive integer values of p and q. Whereas covariance uses p = q =, all positive integer values of p and q are implicit in measures of independence. Formally, if x and y are independent then each moment E[x p y q ] is equal to the product of the expected values of the pdf's marginal distributions E[x p ]E[y q ], which leads to the result stated in Equation (). The formal similarity between measures of independence and covariance can be interpreted as follows. Whereas covariance measures the amount of linear covariation between x and y, independence measures the linear covariation between [x raised to powers p] and [y raised to powers q]. Thus, independence can be considered as a generalised form of covariance, which measures the linear covariation between non-linear functions (e.g. cubed power) of two variables. For example, using x = sin(z) and y = cos(z) we know that Cov(x; y) =. However, the measure of linear 5

6 covariation between the variables x p and y q as depicted in Figure (3) for p = q = 2 is: E[x p y q ]? E[x p ]E[y q ] =?:23 (2) This corresponds to a correlation between x 2 and y 2 of?:864 (see Figure 3). Thus, whereas the correlation between x sin(z) and y = cos(z) is zero, the fact that the value of x can be predicted from y is implicit in the non-zero values of the higher order moments of the distribution of xy. 4.2 Moments and non-gaussian pdfs We have shown that both the covariance and interdependence between two variables x and y are dened in terms of the moments of the pdf of their joint distribution. However, any variable with a Gaussian pdf is special in the sense that it is completely specied by its second moment E[xy]. That is, the values of all higher moments are implicit in the value of the second moment of a Gaussian distribution. Thus, if the covariance E[xy]? E[x]E[y] of the joint distribution of two Gaussian variables is zero then it can be shown that the quantity E[x p y q ]? E[x p ]E[y q ] is zero for all positive integer values of p and q. From Equation () we know that such variables are statistically independent, and it therefore follows that uncorrelated Gaussian variables are also independent. However, non-gaussian variables that are uncorrelated are not, in general, independent. As stated above, the non-gaussian variables x = sin(z) and y = cos(z). Here, E[xy]? E[x]E[y] =, but (for example) E[x 2 y 2 ]? E[x 2 ]E[y 2 ] ==?:23, and the correlation between x 2 and y 2 is r =?:86. Thus, for non-gaussian variables, the dependency between x and y only becomes apparent in their high order moments. 5 Using Independence and Non-Gaussian Assumptions for Source Separation 5. Projection Pursuit: Mixtures of Source Signals Are Gaussian A critical feature of a random linear mixture of any signals (with nite variance) is that a histogram of its values is approximately Gaussian; that is, it has a Gaussian probability density function (pdf). This follows from the central limit theorem, and is is illustrated in Figures 5, 6 and 7. Most mixtures of of a set of signals therefore produce a signal mixture with a Gaussian pdf. As methods for separating sources use a set of mixtures as input, and produce a linear weighting of them as output, it follows that arbitrary `unmixing' matrices W also produce Gaussian signals. However, if an `unmixing` matrix exists that produces a non-gaussian signal from the set of mixtures then such a signal is unlikely to be a mixture of signals. 6

7 If we assume that source signals have non-gaussian pdfs then, whilst most transformations produce data with Gaussian distributions, a small number of transformations exist that produce data with non-gaussian distributions. Under certain conditions, the non-gaussian signals extracted from signal mixtures by such a transformation are in fact the original source signals. This is the basis of projection pursuit methods [Friedman, 987]. In order to set about nding non-gaussian component signals, it is necessary to dene precisely what is meant by the term `non-gaussian'. Two important classes of signals with non-gaussian pdfs have super-gaussian and sub-gaussian pdfs. These are dened in terms of kurtosis, which is dened as R k = T (s? st ) 4 dt R (s? st ) 2 dt? 3 (3) T where s t is the value of a signal at time t, s is the mean value of s t, and the constant (3) ensures that super- Gaussian signals have positive kurtosis, whereas a sub-gaussian signal have negative kurtosis. This can be written more succinctly in terms of expected values, E[:]': k = E[(s? s)4 ] E[(s? s) 2 ]? 3 (4) A signal with a super-gaussian pdf has most of its values clustered around zero, whereas a signal with a sub- Gaussian pdf does not. As examples, a speech signal has a super-gaussian pdf, and a sine function and white noise have sub-gaussian pdfs (see Figure 4). FIGURE 4 HERE. PP methods tend to make use of high-order moments of distributions such as kurtosis in order to estimate the extent to which a signal is non-gaussian. However, here we will use a more general measure, which is borrowed from ICA. The extent to which a signal's pdf is non-gaussian depends on the following critical observation: If the scalar values of a signal s are transformed by the cumulative density function (cdf) of that signal then the resultant distribution of values is uniform. This is useful because it permits the extent of deviation from a Gaussian pdf to be recast in terms of the uniformity, or equivalently, the entropy of the transformed signal, Y = (s) (one way to think of entropy is as a measure of the uniformity of a given distribution). The question of how to nd the linear transformation capable of recovering a source signal follows from the denition of our measure of Gaussian deviation. We have established that a linear transformation W exists such that a signal s = Wx can be recovered from a set of M signal mixtures x, and that this transformation produces a signal s = W x such that S = (Wx) has maximum entropy. By inverting the ow of logic in this argument, it follows that s can be recovered from x by nding a W that maximises the entropy H(Y ) of S = (W x). It can be shown [Girolami and Fyfe, 996] that if a number of signals are extracted from a mixture x (and these are the most non-gaussian signals components of the mixture) then they are guaranteed to be mutually independent. Thus, even though a measure of independence is not explicitly maximised as part of the PP 7

8 optimisation process, extracting non-gaussian signals produces signals that are mutually independent. contrast, ICA explicitly maximises the mutual independence of extracted signals. In 5.2 ICA: Source Signals Are Statistically Independent The ICA methods described in this section are based on the following simple observation: If a set of N signals are from dierent physical sources (e.g. N dierent speakers) then they tend to be statistically independent of each other. The method of ICA is based on the assumption that if a set of independent signals can be extracted from signal mixtures then these extracted signals are likely to be the original source signals. Like PP, ICA requires assumptions of independence that involve the cdfs of source signals, and this is the link that binds ICA and PP methods together. As in the previous section, this problem can be considered in geometric terms. The amplitudes of N source signals at a given time can be represented as a point in an N-dimensional space, and, considered over all times, they dene a distribution of points in this space. If the signals are from dierent sources (e.g. N dierent speakers) then they tend to be statistically independent of each other. As with PP, a key observation is that if a signal s has a cdf then the distribution of (s) has maximum entropy (i.e. is uniform). Similarly, if N signals each have cdf then the joint distribution of (s) = ((s); : : :; (s N )) T has maximum entropy, and is therefore uniform. For a set of signal mixtures x = As, an `unmixing' matrix exists such that s = Wx, where W. Given that (s) has maximum entropy, it follows that s can be recovered by nding a matrix W that maximises the entropy of Y = (Wx) (where is a vector of cdfs in one-to-one correspondence with transformed signals in y = Wx), at which point (Wx) = (s). In summary, for any distribution x which is a mixture of N independent signals each with cdf, there exists a linear unmixing transformation W followed by a non-linear transformation, such that the resultant distribution Y = (W x) has maximum entropy. This can be used to recover the original sources by dening a plausible cdf, and then nding an unmixing matrix W that maximises the entropy of Y. The explicit assumption of independence upon which ICA is based is less critical than the apparently subsidiary assumption regarding the non-gaussian nature of source signals. It can be shown [Girolami and Fyfe, 996] that, given a set of mixtures of independent non-gaussian signals, the sources can be extracted by nding component signals that have appropriate cdfs, and that these signals are independent. The converse is not true, in general, if the number of extracted signals is less than the number of independent signals in the set of signal mixtures. That is, simply nding a subset of independent signals in a set of independent non-gaussian source signal mixtures is not, in general, equivalent to nding the component sources. This is because linear combinations of disjoint sets of source signals are independent. For example, if a subset of independent signals are combined to form a signal mixture x, and a non-overlapping subset of other signals are combined to 8

9 form a mixture x2, then x and x2 are mutually independent, even though both consist of mixtures of source signals. Thus, statistical independence of extracted signals is a necessary, but not sucient, condition for source separation. Having established the connection between ICA and PP, and conditions under which they are equivalent, we proceed by describing the `standard' ICA method [Bell and Sejnowski, 995]. 6 The Nuts and Bolts of ICA Suppose that the outputs x = (x; : : :; x M ) T of M measurement devices are a linear mixture of N = M independent signal sources s = (s; : : :; s N ) T, x = As, where A is an N N mixing matrix. We wish to nd a N N unmixing matrix W such that each of the N components recovered by y = Wx is one of the original signals s (i.e. K N). As discussed above, an unmixing matrix W can be found by maximising the entropy H(Y) of the joint distribution Y = fy; : : :; Y N g = f(y); : : :; N (y N )g, where y i = W x i. The correct i have the same form as the cdfs of the input signals x i. However, in many cases it is sucient to approximate these cdfs by sigmoids 2 Y i = tanh y i. The entropy of a signal y with pdf f x (x) is given by: H(x) =?E[ln f x (x)] =? Z f x (x) ln f x (x) dy (5) As might be expected, the transformation of a given data set x aects the entropy of the transformed data Y according to the change in the amount of `spread' introduced by the transformation. Given a multidimensional signal x, if a cluster of points in x is mapped to a large region in Y, then the transformation implicitly maps innitesimal volumes from one space to another. The `volumetric mapping' between spaces is given by the Jacobian of the transformation between spaces. The Jacobian combines the derivative of each axis in x with respect to every axis in y to form a ratio of innitesimal volumes in x and y. The change in entropy induced by the transformation W can be shown to be equal to the expected value of ln jjj, where j:j denotes absolute value. Given that Y = (Wx), the output entropy H(Y) can be shown to be related to the entropy of the input H(x) by H(Y) = H(x) + E [ log jjj ] (6) where jjj is the determinant of the Jacobian matrix J Note that the entropy of the input H(x) is constant. Given that we wish to nd a W that maximises H(Y), any W that maximises H(Y) is unaected by 2 In fact sources s i normalised so that E[s i tanh s i ] = =2 can be separated using tanh sigmoids if and only if the pairwise conditions i j > are satised, where i = 2E[s 2 i ]E[sech2 s i ] [Porrill, 997]. 9

10 H(x), which can therefore be ignored. Using the chain rule, we can evaluate jjj as: @y Y N = i (y i)jw j (7) i= are Jacobian matrices. Substituting Equation (7) in Equation (6) yields H(Y) = H(x) + E " N X log i (y i) i= # + log jw j: (8) As the entropy of the x is unaected by W, it can be ignored in the maximisation of H(Y). E [ P log i(y i )] can be estimated given n samples from the distribution dened by y: E " NX # log i (y i) i= nx NX n j= i= The term log i (y(j) i ) (9) Ignoring H(x), and substituting Equation (9) in (8) yields a new function that diers from H(Y) by a constant equal to H(x): nx NX h(w ) = n j= i= If we dene the cdf i = tanh then this evaluates to nx NX h(w ) = n j= i= log i(y (j) i ) + log jw j (2) log(? y (j)2 i ) + log jw j: (2) This function can be maximised by taking its derivative with respect to the matrix W : r h W = [W T ]?? 2yx T (22) Now an unmixing matrix can be found by taking small steps of size to update W : W = ([W T ]?? 2yx T ) (23) In fact, the matching of the pdf of y to each cdf also requires that each signal y i has zero mean. This is easily accomodated by introducing a `bias' weight w i to ensure that y i = Wx + w i has zero mean. The value of each bias weight is learned like any other weight in W. For a tanh cdf, this evaluates to: w i =? 2y (24) In practice, h(w ) is maximised either, a) by using a `natural gradient' [Amari, 998] which normalises the error surface so that the step-size along each dimenstion is scaled by the local gradient in that direction, and which obviates the need to invert W at each step, or b) a second order technique (such as BFGS or a conjugate gradient Marquardt method) which estimates an optimal search direction and step-size under the assumption that the error surface is locally quadratic.

11 7 Spatial and Temporal ICA In `standard' ICA, each of N signal mixtures is measured over T time steps, and N sources are recovered as y = Wx, where each source is independent over time of every other source. However, when using ICA to analyse temporal sequences of images it rapidly becomes apparent that there are two alternative ways to implement ICA. 7. Temporal ICA Normally, independent temporal sequences are extracted by placing the image corresponding to each time step in a column of x. We refer to this as ICAt. This essentially treats each of the N pixels as a separate `microphone' or mixture, so that each mixture consists of T time steps. The (large) N N matrix W then nds temporally independent sources that contribute to each pixel grey-level over time. Having discovered the temporally independent signals for an image sequence, this begs the question: what was it that varied independently over time? Given y = Wx we can derive x = Ay (25) where A = W? is an N N matrix. Therefore, each row (source signal) of y species how the contribution to x of one column (image) of A varies over time. So, whereas each row y i of y species a signal that is independent of all rows in y, each column a i of A consists of an image that varies independently over time according to the amplitude of y i. Note that, in general, the rows of s are constrained to be mutually independent, whereas the relationship between columns of A is completely unconstrained. 7.2 Spatial ICA Instead of placing each image in a column of x we can place each image in a row of x. This is equivalent to treating each time step as a mixture of independent images. We refer to this as ICAs. In this case, each source signal (row of y) is an image, where the pixel values in each image (row) are independent of every other image, so that these images are said to be spatially independent. Each column of the T T matrix A is a temporal sequence. In summary, both ICAs and ICAt produce a set of images and a corresponding set of temporal sequences. However, ICAt produces a set of mutually independent temporal sequences and a corresponding set of unconstrained images, whereas ICAt produces mutually independent images and a corresponding set of unconstrained temporal sequences. If it is known that either temporal or spatial independence cannot be assumed then this rules out ICAt or ICAs,

12 respectively. In practice, ICAt is computationally expensive because it involves a P P matrix W. In contrast, ICAs requires a N N matrix W. ICAs has been used to good eect on fmri images [McKeown et al., 998]. If neither spatial nor temporal independence can be assumed then a form of ICA that requires assumptions of minimal dependence over time and space can be used. 8 Using Princpal Component Analysis to Preprocess Data For many data sets, it is impracticable to nd an N N unmixing matrix W because the number N of rows in x is large. In such cases, principle component analysis (PCA) can be used to reduce the size of W. Each of the T N-dimensional column vectors in x denes a single point in an N-dimensional space. If most of these points lie in a K-dimensional subspace (where K N) then we can use K judiciously chosen basis vectors to represent the T columns of x. (E.g. if all the points in a box lie in a two-dimensional square then we can describe the points in terms of the two basis vectors dened by two sides of that square). Such a set of K N-dimensional eigenvectors U can be obtained using PCA. Just as ICA can be used with data vectors in the rows or columns of x, so PCA can be used to nd a set V of K T -dimensional eigenvectors. More importantly, (and momentarily setting K = N) the two sets of eigenvectors U and V are related to each other by a diagonal matrix: x = UDV T (26) Where the diagonal elements of D contain the ordered eigenvalues of the corresponding eigenvectors in the columns of U and the rows of V. This decomposition is produced by singular value decomposition (SVD). Note that each eigenvalue species the amount of data variance associated with the direction dened by a corresponding eigenvector in U and V. We can therefore discard eigenvectors with small eigenvalues because these account for trivial variations in the data set. Setting K N permits a more econimical representation of x: x ~x = U ~ D ~ V ~ T (27) Note that U ~ is now an N K matrix, V ~ is a K K matrix, and D ~ is a diagonal K K matrix. As with ICAs and ICAt, these can be considered in temporal and spatial terms. If each column of x is an image of N pixels then each column of U is an eigenimage, and each column of V is an eigensequence. Given that we require a small unmixing matrix W, it is desirable to use U ~ instead of X for ICAt, and V ~ instead of X T for ICAs. The basic method consists of performing ICA on U ~ or V ~ to obtain K ICs, and then using the relation X ~ = U ~ D ~ V ~ T to obtain the K corresponding columns of A. 2

13 8. ICAt Using SVD Replacing x with ~ V T in y = Wx produces y = W ~ V T (28) where each row of the K T matrix ~ V T is an `eigensequence', and W is a K K matrix. In this case, ICA recovers K mutually independent sequences, each of length T. The set of images corresponding to the K temporal ICs can be obtained as follows. Given ~V T = Ay = W? y (29) and x = ~ U ~ D ~ V T (3) we have x = ~ U ~ DW? y (3) = Ay (32) From which it follows that A = ~ U ~ DW? (33) where A is a N K matrix in which each column is an image. Thus, we have extracted K independent T -dimensional sequences and their corresponding N-dimensional images using a K K unmixing matrix W. 8.2 ICAs Using SVD A similar method can be used to nd K independent images and their corresponding time courses. Replacing x T with ~ U T in y = Wx T produces y = W ~ U t (34) where each row of the K N matrix U ~ T is an `eigenimage', and W is a K K matrix. In this case, ICA recovers K mutually independent images, each of length N. The set of images corresponding to the K spatial ICs can be obtained as follows. Given ~U T = Ay = W? y (35) and x T = ~ V ~ D ~ U T (36) 3

14 we have x = ~ V ~ DW? y (37) = Ay (38) From which it follows that A = ~ V ~ D ~ W? (39) where A is a T K matrix in which each column is a time course. Thus, we have extracted K independent N-dimensional images and their corresponding T -dimensional time courses using a K K unmixing matrix W. Note that, using SVD in this manner requires an assumption that the ICs are not distributed amongst the smaller eigenvectors which are usually discarded. The validity of this assumption is by no means guaranteed. 9 Emulating Singular Value Decomposition We have shown how ICA can be made tractable by using principal components (PCs) obtained from SVD. However, for large dimensional data, U or V can be too large for most computers. For instance, if the data matrix x is an N T matrix then U is N T and V is T T. If each column in x contains an image with N pixels then neither x nor U may be small enough to t into the RAM available on a computer. It is possible to compute U D and V in an iterative manner using techniques that do not rely on SVD. Throughout the following we assume that V is samller than U. For convenience, we assume that x consists of one image per column, and that each column corresponds to one of N time steps. We proceed by nding V and D, from which U can be obtained by combining V with x. The matrix V contains one eigensequence per column. This can be obtained from the temporal N N covariance matrix of x, which is dened by the outer product: C = x T x (4) This covariance matrix is the starting point of many standard PCA algorithms. After PCA we have V and a corresponding set of ordered eigenvalues. The matrix D which is normally obtained with SVD can be constructed by setting each diagonal element to the square of each corresponding eigenvalue. Given that: x = UDV T (4) it follows that: U = xv D? (42) 4

15 So, given V and D from a PCA of the covariance matrix of x, we can obtain the eigenimages U. Note that we can compute as many eigenimages as required by simply omitting corresponding eigensequences and eigenvalues from V and D, respectively. SVD in relation to ICAs and ICAt If we distribute the eigenvalues between U and V by multiplying each column in U and V by the square root of its corresponding eigenvalue then we have: x = UV T (43) which has a similar form to the ICA decomposition: x = As (44) The main dierence between SVD and ICA is as follows. Each matrix produced by SVD has orthogonal columns. That is, the variation in each column is uncorrelated with variations in every other column within U and V. In contrast, ICA produces two matrices with quite dierent properties. Rather than being uncorrelated, the rows of s are independent. This stringent requirement on the rows of s suggest that the columns of A cannot also be independent, in general, and ICA actually places no constraints on the relationships between columns of A. Acknowledgements J Stone is supported by a Mathematical Biology Wellcome Fellowship (Grant number 44823). References [Amari, 998] Amari, A. (998). Natural gradient works eciently in learning. Neural Computation, (2):25{ 276. [Bell and Sejnowski, 995] Bell, A. and Sejnowski, T. (995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7:29{59. [Bell and TJ, 997] Bell, A. and TJ, S. (997). The `independent components' of natural scenes are edge lters. Vision Research, 37(23):3327{3338. [Friedman, 987] Friedman, J. (987). Exploratory projection pursuit. J Amer. Statistical Association, 82(397):249{266. [Girolami and Fyfe, 996] Girolami, M. and Fyfe, C. (996). Negentropy and kurtosis as projection pursuit indices provide generalised ica algorithms. NIPS96 Blind Signal Separation Workshop. 5

16 [Jutten and Herault, 988] Jutten, C. and Herault, J. (988). Independent component analysis versus pca. In Proc. EUSIPCO, pages 643 { 646. [Makeig et al., 997] Makeig, S., Jung, T., Bell, A., Ghahremani, D., and Sejnowski, T. (997). Blind separation of auditory event-related brain responses into independent components. Proc. Natl. Acad. Sci, 94:979{984. [McKeown et al., 998] McKeown, M., Makeig, S., Brown, G., Jung, T., Kindermann, S., and Sejnowski, T. (998). Spatially independent activity patterns in functional magnetic resonance imaging data during the stroop color-naming task. Proceedings of the National Academy of Sciences USA (In Press). [Porrill, 997] Porrill, J. (997). Independent component analysis: Conditions for a local maximum. Technical Report 23, Psychology Department, Sheeld University, England. [van Hateren and van der Schaaf, 998] van Hateren, J. and van der Schaaf, A. (998). Independent component lters of natural images compared with simple cells in primary visual cortex. Prc Royal Soc London (B), 265(7):359{366. 6

17 Figure : The geometry of source separation. a) Plot of signal s versus s2. Each point s t in S represents the amplitudes of the source signals st and s2t at time t. These signals are plotted separately in Figure 2. b) Plot of signal mixture x versus x2. Each point x t = As t in X represents the amplitudes of the signal mixtures xt and x2t at time t. These signal mixtures are plotted separately in Figure 2. The orthogonal axes S and S2 in S (solid lines in Figure a) are transformed by the mixing matrix A to form the skewed axes S and S2 in X (solid lines in Figure b). An `unmixing' matrix W consists of two row vectors, each of which `selects' a direction associated with a dierent signal in X. The dashed line in Figure b species one row vector w of an `unmixing' matrix W which is (in general) orthogonal to every transformed axis Si except one (S, in this case). Variations in signal amplitude associated with directions (such as S2) that are orthogonal to w have no eect on the inner product y = w x. Therefore, y only reects amplitude changes associated with the direction S, so that y = ks where k is a constant that equals unity if S and w are co-linear. 7

18 Figure 2: Separation of two signals. Original signals s = s; s2 are displayed in the left hand graphs. Two signal mixtures x = As are displayed in middle graphs. The results of applying an unmixing matrix W = A? to the mixtures x = Wx are displayed in the right hand graphs. 8

19 Figure 3: The interdependence of x = sin(z) and y = cos(z) is only apparent in the higher order moments of the joint distribution of xy. a) Plot of x = sin(z) versus y = cos(z). Even though the value of x is highly predictable given the corresponding value of y (and vice versa), the correlation between x and y is r =. For display purposes, noise has been added in order to make the set of points visible. b) Plot of sin 2 (z) versus cos 2 (z). The correlation between sin 2 (z) and cos 2 (z) is r =?:864. Whereas x and y are uncorrelated if the correlation between x and y is zero, they are statistically independent only if the correlation between x p and y q is zero for all positive integer values of p and q. Therefore, sin(z) and cos(z) are uncorrelated, but not independent. 9

20 Figure 4: Histograms of signals with dierent probability density functions. From left to right, histograms of super-guassian, Guassian, and sub-guassian signal. The left hand histogram is derived from a portion of Handel's Messiah, the middle histogram is derived from Gaussian noise, and the right hand histogram is derived from a sine wave. 2

21 Figure 5: Six sound signals and their pdfs. Each signal consists of ten thousand samples. From top to bottom: chirping, gong, Handel's Messiah, people laughing, whistle-plop, steam train. 2

22 Figure 6: The outputs of six microphones, each of which receives input from six sound sources according to its proximity to each source. Each microphone receives a dierent mixture of the six non-gaussian signals displayed in Figure 5. Note that the pdf of each signal mixture shown on the rhs is approximately Gaussian. 22

23 Figure 7: A typical signal produced by applying a random `unmixing' matrix to the six signal mixtures displayed in 6. The resultant signal has a pdf that is approximately Gaussian. From top to bottom: a single mixture of the six signals shown in Figure 5, the mixture's pdf, pdf of a Gaussian signal. Note that correct unmixing matrix would produce each of the original source signals displayed in Figure 5. 23

Undercomplete Independent Component. Analysis for Signal Separation and. Dimension Reduction. Category: Algorithms and Architectures.

Undercomplete Independent Component. Analysis for Signal Separation and. Dimension Reduction. Category: Algorithms and Architectures. Undercomplete Independent Component Analysis for Signal Separation and Dimension Reduction John Porrill and James V Stone Psychology Department, Sheeld University, Sheeld, S10 2UR, England. Tel: 0114 222

More information

Independent Component Analysis

Independent Component Analysis Independent Component Analysis James V. Stone November 4, 24 Sheffield University, Sheffield, UK Keywords: independent component analysis, independence, blind source separation, projection pursuit, complexity

More information

Independent component analysis: an introduction

Independent component analysis: an introduction Research Update 59 Techniques & Applications Independent component analysis: an introduction James V. Stone Independent component analysis (ICA) is a method for automatically identifying the underlying

More information

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II Gatsby Unit University College London 27 Feb 2017 Outline Part I: Theory of ICA Definition and difference

More information

CIFAR Lectures: Non-Gaussian statistics and natural images

CIFAR Lectures: Non-Gaussian statistics and natural images CIFAR Lectures: Non-Gaussian statistics and natural images Dept of Computer Science University of Helsinki, Finland Outline Part I: Theory of ICA Definition and difference to PCA Importance of non-gaussianity

More information

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo The Fixed-Point Algorithm and Maximum Likelihood Estimation for Independent Component Analysis Aapo Hyvarinen Helsinki University of Technology Laboratory of Computer and Information Science P.O.Box 5400,

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Independent Component Analysis

Independent Component Analysis A Short Introduction to Independent Component Analysis Aapo Hyvärinen Helsinki Institute for Information Technology and Depts of Computer Science and Psychology University of Helsinki Problem of blind

More information

HST.582J/6.555J/16.456J

HST.582J/6.555J/16.456J Blind Source Separation: PCA & ICA HST.582J/6.555J/16.456J Gari D. Clifford gari [at] mit. edu http://www.mit.edu/~gari G. D. Clifford 2005-2009 What is BSS? Assume an observation (signal) is a linear

More information

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004 Independent Component Analysis and Its Applications By Qing Xue, 10/15/2004 Outline Motivation of ICA Applications of ICA Principles of ICA estimation Algorithms for ICA Extensions of basic ICA framework

More information

Lie Groups for 2D and 3D Transformations

Lie Groups for 2D and 3D Transformations Lie Groups for 2D and 3D Transformations Ethan Eade Updated May 20, 2017 * 1 Introduction This document derives useful formulae for working with the Lie groups that represent transformations in 2D and

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Independent Component Analysis

Independent Component Analysis A Short Introduction to Independent Component Analysis with Some Recent Advances Aapo Hyvärinen Dept of Computer Science Dept of Mathematics and Statistics University of Helsinki Problem of blind source

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Independent Component Analysis Barnabás Póczos Independent Component Analysis 2 Independent Component Analysis Model original signals Observations (Mixtures)

More information

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007 MIT OpenCourseWare http://ocw.mit.edu HST.582J / 6.555J / 16.456J Biomedical Signal and Image Processing Spring 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Independent Component Analysis (ICA) Bhaskar D Rao University of California, San Diego

Independent Component Analysis (ICA) Bhaskar D Rao University of California, San Diego Independent Component Analysis (ICA) Bhaskar D Rao University of California, San Diego Email: brao@ucsdedu References 1 Hyvarinen, A, Karhunen, J, & Oja, E (2004) Independent component analysis (Vol 46)

More information

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data. Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two

More information

Independent Components Analysis

Independent Components Analysis CS229 Lecture notes Andrew Ng Part XII Independent Components Analysis Our next topic is Independent Components Analysis (ICA). Similar to PCA, this will find a new basis in which to represent our data.

More information

Unsupervised learning: beyond simple clustering and PCA

Unsupervised learning: beyond simple clustering and PCA Unsupervised learning: beyond simple clustering and PCA Liza Rebrova Self organizing maps (SOM) Goal: approximate data points in R p by a low-dimensional manifold Unlike PCA, the manifold does not have

More information

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Dimensionality Reduction CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Visualize high dimensional data (and understand its Geometry) } Project the data into lower dimensional spaces }

More information

Independent Component Analysis

Independent Component Analysis 1 Independent Component Analysis Background paper: http://www-stat.stanford.edu/ hastie/papers/ica.pdf 2 ICA Problem X = AS where X is a random p-vector representing multivariate input measurements. S

More information

STATS 306B: Unsupervised Learning Spring Lecture 12 May 7

STATS 306B: Unsupervised Learning Spring Lecture 12 May 7 STATS 306B: Unsupervised Learning Spring 2014 Lecture 12 May 7 Lecturer: Lester Mackey Scribe: Lan Huong, Snigdha Panigrahi 12.1 Beyond Linear State Space Modeling Last lecture we completed our discussion

More information

Independent component analysis: algorithms and applications

Independent component analysis: algorithms and applications PERGAMON Neural Networks 13 (2000) 411 430 Invited article Independent component analysis: algorithms and applications A. Hyvärinen, E. Oja* Neural Networks Research Centre, Helsinki University of Technology,

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

A Constrained EM Algorithm for Independent Component Analysis

A Constrained EM Algorithm for Independent Component Analysis LETTER Communicated by Hagai Attias A Constrained EM Algorithm for Independent Component Analysis Max Welling Markus Weber California Institute of Technology, Pasadena, CA 91125, U.S.A. We introduce a

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

EE731 Lecture Notes: Matrix Computations for Signal Processing

EE731 Lecture Notes: Matrix Computations for Signal Processing EE731 Lecture Notes: Matrix Computations for Signal Processing James P. Reilly c Department of Electrical and Computer Engineering McMaster University September 22, 2005 0 Preface This collection of ten

More information

ICA [6] ICA) [7, 8] ICA ICA ICA [9, 10] J-F. Cardoso. [13] Matlab ICA. Comon[3], Amari & Cardoso[4] ICA ICA

ICA [6] ICA) [7, 8] ICA ICA ICA [9, 10] J-F. Cardoso. [13] Matlab ICA. Comon[3], Amari & Cardoso[4] ICA ICA 16 1 (Independent Component Analysis: ICA) 198 9 ICA ICA ICA 1 ICA 198 Jutten Herault Comon[3], Amari & Cardoso[4] ICA Comon (PCA) projection persuit projection persuit ICA ICA ICA 1 [1] [] ICA ICA EEG

More information

Single Channel Signal Separation Using MAP-based Subspace Decomposition

Single Channel Signal Separation Using MAP-based Subspace Decomposition Single Channel Signal Separation Using MAP-based Subspace Decomposition Gil-Jin Jang, Te-Won Lee, and Yung-Hwan Oh 1 Spoken Language Laboratory, Department of Computer Science, KAIST 373-1 Gusong-dong,

More information

Independent Component Analysis

Independent Component Analysis 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 1 Introduction Indepent

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

Independent Component Analysis of Incomplete Data

Independent Component Analysis of Incomplete Data Independent Component Analysis of Incomplete Data Max Welling Markus Weber California Institute of Technology 136-93 Pasadena, CA 91125 fwelling,rmwg@vision.caltech.edu Keywords: EM, Missing Data, ICA

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

22.3. Repeated Eigenvalues and Symmetric Matrices. Introduction. Prerequisites. Learning Outcomes

22.3. Repeated Eigenvalues and Symmetric Matrices. Introduction. Prerequisites. Learning Outcomes Repeated Eigenvalues and Symmetric Matrices. Introduction In this Section we further develop the theory of eigenvalues and eigenvectors in two distinct directions. Firstly we look at matrices where one

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

[POLS 8500] Review of Linear Algebra, Probability and Information Theory [POLS 8500] Review of Linear Algebra, Probability and Information Theory Professor Jason Anastasopoulos ljanastas@uga.edu January 12, 2017 For today... Basic linear algebra. Basic probability. Programming

More information

Robustness of Principal Components

Robustness of Principal Components PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Covariance and Principal Components

Covariance and Principal Components COMP3204/COMP6223: Computer Vision Covariance and Principal Components Jonathon Hare jsh2@ecs.soton.ac.uk Variance and Covariance Random Variables and Expected Values Mathematicians talk variance (and

More information

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,

More information

Vectors and Matrices Statistics with Vectors and Matrices

Vectors and Matrices Statistics with Vectors and Matrices Vectors and Matrices Statistics with Vectors and Matrices Lecture 3 September 7, 005 Analysis Lecture #3-9/7/005 Slide 1 of 55 Today s Lecture Vectors and Matrices (Supplement A - augmented with SAS proc

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

Wavelet Transform And Principal Component Analysis Based Feature Extraction

Wavelet Transform And Principal Component Analysis Based Feature Extraction Wavelet Transform And Principal Component Analysis Based Feature Extraction Keyun Tong June 3, 2010 As the amount of information grows rapidly and widely, feature extraction become an indispensable technique

More information

Tutorial on Blind Source Separation and Independent Component Analysis

Tutorial on Blind Source Separation and Independent Component Analysis Tutorial on Blind Source Separation and Independent Component Analysis Lucas Parra Adaptive Image & Signal Processing Group Sarnoff Corporation February 09, 2002 Linear Mixtures... problem statement...

More information

Independent Component Analysis. Contents

Independent Component Analysis. Contents Contents Preface xvii 1 Introduction 1 1.1 Linear representation of multivariate data 1 1.1.1 The general statistical setting 1 1.1.2 Dimension reduction methods 2 1.1.3 Independence as a guiding principle

More information

Chapter 3 Transformations

Chapter 3 Transformations Chapter 3 Transformations An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Linear Transformations A function is called a linear transformation if 1. for every and 2. for every If we fix the bases

More information

Machine Learning (Spring 2012) Principal Component Analysis

Machine Learning (Spring 2012) Principal Component Analysis 1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Introduction Consider a zero mean random vector R n with autocorrelation matri R = E( T ). R has eigenvectors q(1),,q(n) and associated eigenvalues λ(1) λ(n). Let Q = [ q(1)

More information

Recursive Generalized Eigendecomposition for Independent Component Analysis

Recursive Generalized Eigendecomposition for Independent Component Analysis Recursive Generalized Eigendecomposition for Independent Component Analysis Umut Ozertem 1, Deniz Erdogmus 1,, ian Lan 1 CSEE Department, OGI, Oregon Health & Science University, Portland, OR, USA. {ozertemu,deniz}@csee.ogi.edu

More information

PRINCIPAL COMPONENTS ANALYSIS

PRINCIPAL COMPONENTS ANALYSIS 121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves

More information

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello Artificial Intelligence Module 2 Feature Selection Andrea Torsello We have seen that high dimensional data is hard to classify (curse of dimensionality) Often however, the data does not fill all the space

More information

Final Report For Undergraduate Research Opportunities Project Name: Biomedical Signal Processing in EEG. Zhang Chuoyao 1 and Xu Jianxin 2

Final Report For Undergraduate Research Opportunities Project Name: Biomedical Signal Processing in EEG. Zhang Chuoyao 1 and Xu Jianxin 2 ABSTRACT Final Report For Undergraduate Research Opportunities Project Name: Biomedical Signal Processing in EEG Zhang Chuoyao 1 and Xu Jianxin 2 Department of Electrical and Computer Engineering, National

More information

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x = Linear Algebra Review Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1 x x = 2. x n Vectors of up to three dimensions are easy to diagram.

More information

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA) Fundamentals of Principal Component Analysis (PCA),, and Independent Vector Analysis (IVA) Dr Mohsen Naqvi Lecturer in Signal and Information Processing, School of Electrical and Electronic Engineering,

More information

Plan of Class 4. Radial Basis Functions with moving centers. Projection Pursuit Regression and ridge. Principal Component Analysis: basic ideas

Plan of Class 4. Radial Basis Functions with moving centers. Projection Pursuit Regression and ridge. Principal Component Analysis: basic ideas Plan of Class 4 Radial Basis Functions with moving centers Multilayer Perceptrons Projection Pursuit Regression and ridge functions approximation Principal Component Analysis: basic ideas Radial Basis

More information

var D (B) = var(b? E D (B)) = var(b)? cov(b; D)(var(D))?1 cov(d; B) (2) Stone [14], and Hartigan [9] are among the rst to discuss the role of such ass

var D (B) = var(b? E D (B)) = var(b)? cov(b; D)(var(D))?1 cov(d; B) (2) Stone [14], and Hartigan [9] are among the rst to discuss the role of such ass BAYES LINEAR ANALYSIS [This article appears in the Encyclopaedia of Statistical Sciences, Update volume 3, 1998, Wiley.] The Bayes linear approach is concerned with problems in which we want to combine

More information

ICA. Independent Component Analysis. Zakariás Mátyás

ICA. Independent Component Analysis. Zakariás Mátyás ICA Independent Component Analysis Zakariás Mátyás Contents Definitions Introduction History Algorithms Code Uses of ICA Definitions ICA Miture Separation Signals typical signals Multivariate statistics

More information

The Hilbert Space of Random Variables

The Hilbert Space of Random Variables The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2

More information

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

Natural Gradient Learning for Over- and Under-Complete Bases in ICA NOTE Communicated by Jean-François Cardoso Natural Gradient Learning for Over- and Under-Complete Bases in ICA Shun-ichi Amari RIKEN Brain Science Institute, Wako-shi, Hirosawa, Saitama 351-01, Japan Independent

More information

Lecture 2: Review of Prerequisites. Table of contents

Lecture 2: Review of Prerequisites. Table of contents Math 348 Fall 217 Lecture 2: Review of Prerequisites Disclaimer. As we have a textbook, this lecture note is for guidance and supplement only. It should not be relied on when preparing for exams. In this

More information

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Joint Probability Distributions and Random Samples (Devore Chapter Five) Joint Probability Distributions and Random Samples (Devore Chapter Five) 1016-345-01: Probability and Statistics for Engineers Spring 2013 Contents 1 Joint Probability Distributions 2 1.1 Two Discrete

More information

Massoud BABAIE-ZADEH. Blind Source Separation (BSS) and Independent Componen Analysis (ICA) p.1/39

Massoud BABAIE-ZADEH. Blind Source Separation (BSS) and Independent Componen Analysis (ICA) p.1/39 Blind Source Separation (BSS) and Independent Componen Analysis (ICA) Massoud BABAIE-ZADEH Blind Source Separation (BSS) and Independent Componen Analysis (ICA) p.1/39 Outline Part I Part II Introduction

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Repeated Eigenvalues and Symmetric Matrices

Repeated Eigenvalues and Symmetric Matrices Repeated Eigenvalues and Symmetric Matrices. Introduction In this Section we further develop the theory of eigenvalues and eigenvectors in two distinct directions. Firstly we look at matrices where one

More information

On Information Maximization and Blind Signal Deconvolution

On Information Maximization and Blind Signal Deconvolution On Information Maximization and Blind Signal Deconvolution A Röbel Technical University of Berlin, Institute of Communication Sciences email: roebel@kgwtu-berlinde Abstract: In the following paper we investigate

More information

Chapter 15 - BLIND SOURCE SEPARATION:

Chapter 15 - BLIND SOURCE SEPARATION: HST-582J/6.555J/16.456J Biomedical Signal and Image Processing Spr ing 2005 Chapter 15 - BLIND SOURCE SEPARATION: Principal & Independent Component Analysis c G.D. Clifford 2005 Introduction In this chapter

More information

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018 CPSC 340: Machine Learning and Data Mining Sparse Matrix Factorization Fall 2018 Last Time: PCA with Orthogonal/Sequential Basis When k = 1, PCA has a scaling problem. When k > 1, have scaling, rotation,

More information

Continuous Random Variables

Continuous Random Variables 1 / 24 Continuous Random Variables Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical Engineering Indian Institute of Technology Bombay February 27, 2013 2 / 24 Continuous Random Variables

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

Properties of Matrices and Operations on Matrices

Properties of Matrices and Operations on Matrices Properties of Matrices and Operations on Matrices A common data structure for statistical analysis is a rectangular array or matris. Rows represent individual observational units, or just observations,

More information

Introduction to Independent Component Analysis. Jingmei Lu and Xixi Lu. Abstract

Introduction to Independent Component Analysis. Jingmei Lu and Xixi Lu. Abstract Final Project 2//25 Introduction to Independent Component Analysis Abstract Independent Component Analysis (ICA) can be used to solve blind signal separation problem. In this article, we introduce definition

More information

Review of Probability Theory

Review of Probability Theory Review of Probability Theory Arian Maleki and Tom Do Stanford University Probability theory is the study of uncertainty Through this class, we will be relying on concepts from probability theory for deriving

More information

Different Estimation Methods for the Basic Independent Component Analysis Model

Different Estimation Methods for the Basic Independent Component Analysis Model Washington University in St. Louis Washington University Open Scholarship Arts & Sciences Electronic Theses and Dissertations Arts & Sciences Winter 12-2018 Different Estimation Methods for the Basic Independent

More information

Gaussian random variables inr n

Gaussian random variables inr n Gaussian vectors Lecture 5 Gaussian random variables inr n One-dimensional case One-dimensional Gaussian density with mean and standard deviation (called N, ): fx x exp. Proposition If X N,, then ax b

More information

Independent Component Analysis (ICA)

Independent Component Analysis (ICA) Independent Component Analysis (ICA) Université catholique de Louvain (Belgium) Machine Learning Group http://www.dice.ucl ucl.ac.be/.ac.be/mlg/ 1 Overview Uncorrelation vs Independence Blind source separation

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2 MA 575 Linear Models: Cedric E Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2 1 Revision: Probability Theory 11 Random Variables A real-valued random variable is

More information

Non-Euclidean Independent Component Analysis and Oja's Learning

Non-Euclidean Independent Component Analysis and Oja's Learning Non-Euclidean Independent Component Analysis and Oja's Learning M. Lange 1, M. Biehl 2, and T. Villmann 1 1- University of Appl. Sciences Mittweida - Dept. of Mathematics Mittweida, Saxonia - Germany 2-

More information

Linear Algebra and Robot Modeling

Linear Algebra and Robot Modeling Linear Algebra and Robot Modeling Nathan Ratliff Abstract Linear algebra is fundamental to robot modeling, control, and optimization. This document reviews some of the basic kinematic equations and uses

More information

1 Principal Components Analysis

1 Principal Components Analysis Lecture 3 and 4 Sept. 18 and Sept.20-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Principal Components Analysis Principal components analysis (PCA) is a very popular technique for

More information

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

MATH 829: Introduction to Data Mining and Analysis Principal component analysis 1/11 MATH 829: Introduction to Data Mining and Analysis Principal component analysis Dominique Guillot Departments of Mathematical Sciences University of Delaware April 4, 2016 Motivation 2/11 High-dimensional

More information

Dimensionality Reduction

Dimensionality Reduction Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

LECTURE :ICA. Rita Osadchy. Based on Lecture Notes by A. Ng

LECTURE :ICA. Rita Osadchy. Based on Lecture Notes by A. Ng LECURE :ICA Rita Osadchy Based on Lecture Notes by A. Ng Cocktail Party Person 1 2 s 1 Mike 2 s 3 Person 3 1 Mike 1 s 2 Person 2 3 Mike 3 microphone signals are mied speech signals 1 2 3 ( t) ( t) ( t)

More information

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata ' / PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE Noboru Murata Waseda University Department of Electrical Electronics and Computer Engineering 3--

More information

Notes on Latent Semantic Analysis

Notes on Latent Semantic Analysis Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically

More information

Maximum variance formulation

Maximum variance formulation 12.1. Principal Component Analysis 561 Figure 12.2 Principal component analysis seeks a space of lower dimensionality, known as the principal subspace and denoted by the magenta line, such that the orthogonal

More information

ADAPTIVE LATERAL INHIBITION FOR NON-NEGATIVE ICA. Mark Plumbley

ADAPTIVE LATERAL INHIBITION FOR NON-NEGATIVE ICA. Mark Plumbley Submitteed to the International Conference on Independent Component Analysis and Blind Signal Separation (ICA2) ADAPTIVE LATERAL INHIBITION FOR NON-NEGATIVE ICA Mark Plumbley Audio & Music Lab Department

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. Broadly, these techniques can be used in data analysis and visualization

More information

Jim Lambers MAT 610 Summer Session Lecture 1 Notes

Jim Lambers MAT 610 Summer Session Lecture 1 Notes Jim Lambers MAT 60 Summer Session 2009-0 Lecture Notes Introduction This course is about numerical linear algebra, which is the study of the approximate solution of fundamental problems from linear algebra

More information

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to: System 2 : Modelling & Recognising Modelling and Recognising Classes of Classes of Shapes Shape : PDM & PCA All the same shape? System 1 (last lecture) : limited to rigidly structured shapes System 2 :

More information

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen Lecture 3: Linear feature extraction Feature extraction feature extraction: (more general) transform the original to (k < d).

More information

L26: Advanced dimensionality reduction

L26: Advanced dimensionality reduction L26: Advanced dimensionality reduction The snapshot CA approach Oriented rincipal Components Analysis Non-linear dimensionality reduction (manifold learning) ISOMA Locally Linear Embedding CSCE 666 attern

More information

Separation of Different Voices in Speech using Fast Ica Algorithm

Separation of Different Voices in Speech using Fast Ica Algorithm Volume-6, Issue-6, November-December 2016 International Journal of Engineering and Management Research Page Number: 364-368 Separation of Different Voices in Speech using Fast Ica Algorithm Dr. T.V.P Sundararajan

More information

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis: Uses one group of variables (we will call this X) In

More information

Machine Learning for Signal Processing. Analysis. Class Nov Instructor: Bhiksha Raj. 8 Nov /18797

Machine Learning for Signal Processing. Analysis. Class Nov Instructor: Bhiksha Raj. 8 Nov /18797 11-755 Machine Learning for Signal Processing Independent Component Analysis Class 20. 8 Nov 2012 Instructor: Bhiksha Raj 8 Nov 2012 11755/18797 1 A brief review of basic probability Uncorrelated: Two

More information

Stat 5101 Notes: Algorithms (thru 2nd midterm)

Stat 5101 Notes: Algorithms (thru 2nd midterm) Stat 5101 Notes: Algorithms (thru 2nd midterm) Charles J. Geyer October 18, 2012 Contents 1 Calculating an Expectation or a Probability 2 1.1 From a PMF........................... 2 1.2 From a PDF...........................

More information