Optimization and Testing in Linear. Non-Gaussian Component Analysis

Size: px

Start display at page:

Download "Optimization and Testing in Linear. Non-Gaussian Component Analysis"

Alvin Gregory Paul
5 years ago
Views:

1 Optimization and Testing in Linear Non-Gaussian Component Analysis arxiv: v2 [stat.me] 29 Dec 2017 Ze Jin, Benjamin B. Risk, David S. Matteson May 13, 2018 Abstract Independent component analysis (ICA) decomposes multivariate data into mutually independent components (ICs). The ICA model is subject to a constraint that at most one of these components is Gaussian, which is required for model identifiability. Linear non-gaussian component analysis (LNGCA) generalizes the ICA model to a linear latent factor model with any number of both non-gaussian components (signals) and Gaussian components (noise), where observations are linear combinations of independent components. Although the individual Gaussian components are not identifiable, the Gaussian subspace is identifiable. We introduce an estimator along with its optimization approach in which non-gaussian and Gaussian components are estimated simultaneously, maximizing the of each non-gaussian component from Gaussianity while minimizing the of each Gaussian component from Gaussianity. When the number of non-gaussian components is unknown, we develop a statistical test to determine it based on resampling and the of estimated components. Through a variety of simulation studies, we demonstrate the improvements of our estimator over competing estimators, and we illustrate the effectiveness of the test to determine the number of non-gaussian components. Further, we apply our method to real data examples and demonstrate its practical value. Key words: independent component analysis; multivariate analysis; hypothesis testing; subspace estimation; dimension reduction; projection pursuit Research support from an NSF award (DMS ), a Xerox PARC Faculty Research Award, and Cornell University Atkinson Center for a Sustainable Future (AVF-2017). 1

2 1 Introduction Independent component analysis (ICA) finds a representation of multivariate data based on mutually independent components (ICs). As an unsupervised learning method, ICA has been developed for applications including blind source separation, feature extraction, brain imaging, and many others. Hyvärinen et al. (2004) provided an overview of ICA approaches for measuring the non-gaussianity and estimating the ICs. Let Y = (Y 1,..., Y p ) T R p be a random vector of observations. Assume that Y has a nonsingular, continuous distribution F Y, with E(Y j ) = 0 and Var(Y j ) <, j = 1,..., p. Let X = (X 1,..., X p ) T R p be a random vector of latent components. Without loss of generality, X is assumed to be standardized such that E(X j ) = 0 and Var(X j ) = 1, j = 1,..., p. A static linear latent factor model to estimate the components X from the observations Y is given by Y = AX, X = A 1 Y BY where A R p p is a constant, nonsingular mixing matrix, and B R p p is a constant, nonsingular unmixing matrix. Pre-whitened random variables are uncorrelated and thus easier to work with from both practical and theoretical perspectives. Let Σ Y = Cov(Y ) be the covariance matrix of Y, and H = Σ 1/2 Y be an uncorrelating matrix. Let Z = HY = (Z 1,..., Z p ) T R p be a random vector of uncorrelated observations, such that Σ Z = Cov(Z) = I p, the p p identity matrix. The ICA model further assumes that the components X 1,..., X p are mutually independent, in which the number of Gaussian components is at most one. Then the relationship between 2

3 X and Z in the ICA model is X = A 1 Y = A 1 H 1 Z W Z = M T Z, Z = W 1 X = HAX MX = W T X (1) where W = A 1 H 1 R p p is a constant, nonsingular unmixing matrix, and M = HA R p p is a constant, nonsingular mixing matrix. Given that Z are uncorrelated observations, W is an orthogonal matrix, and M is an orthogonal matrix as well. Thus, we have W = M 1 = M T and M = W 1 = W T. Many methods have been proposed for estimating the ICA model in the literature, including the fourth-moment diagonalization of FOBI (Cardoso, 1989) and JADE (Cardoso and Souloumiac, 1993), the information criterion of Infomax (Bell and Sejnowski, 1995), maximizing negentropy in FastICA (Hyvärinen and Oja, 1997), the maximum likelihood principle of ProDenICA (Hastie and Tibshirani, 2003), and the mutual dependence measure of dcov- ICA (Matteson and Tsay, 2017) and MDMICA (Jin and Matteson, 2017). Most of them use optimization to obtain the components such that they have maximal non-gaussianity under the constraint that they are uncorrelated. The goal is to use Z to estimate both W and X, by maximizing the non-gaussianity of the components in X, according to a particular measure of non-gaussianity. To overcome the limit of the ICA model that at most one Gaussian component exists, the NGCA (non-gaussian component analysis) model was proposed Blanchard et al. (2006). Beginning with (1), the components X R p are decomposed into signals S R q and noise N R p q, and M into M S and M N, and W into W S and W N correspondingly. The components in S are assumed to be non-gaussian, while the components in N are assumed to be Gaussian. The NGCA model further assumes that the non-gaussian components S are independent of the Gaussian components N, the components in N are mutually independent and thus are multivariate normal, although the components in S may remain mutually 3

4 dependent. Then the relationship between X and Z in the NGCA model is S = X = W Z = W SZ, N W N Z ] Z = MX = [M S M N S = M S S + M N N N (2) where M S R p q has rank q, M N R p (p q) has rank p q, W S R q p has rank q, and W N R (p q) p has rank p q. The goal is to estimate the non-gaussian subspace spanned by the rows in W S corresponding to S, as the Gaussian subspace corresponding to N is uninteresting. Kawanabe et al. (2007) developed an improved algorithm based on radial kernel functions. Theis et al. (2011) proved a necessary and sufficient condition for the uniqueness of the non-gaussian subspace from projection methods. Bean (2014) developed theory for an approach based on characteristic functions. Sasaki et al. (2016) introduced a least-squares NGCA (LSNGCA) algorithm based on least-squares estimation of log-density gradients and eigenvalue decomposition, and Shiino et al. (2016) proposed a whitening-free variant of LSNGCA. Nordhausen et al. (2017) developed asymptotic and bootstrap tests for the dimension of non-gaussian subspace based on the FOBI method. To incorporate nice characteristics from both the ICA model and NGCA model, we consider the LNGCA (linear non-gaussian component analysis) model proposed in Risk et al. (2017) as a special case of the NGCA model, which is the same as the the NGICA model in Virta et al. (2016). In the form of (2), the LNGCA model further assumes that the components X 1,..., X p are mutually independent, and allows any number of both non- Gaussian components and Gaussian components among them. Similarly, we have W = M 1 = M T and M = W 1 = W T. Then the relationship between X and Z in the LNGCA 4

5 model is S = X = W Z = W SZ = M T Z = M S T Z, N W N Z MN T Z ] Z = MX = [M S M N S = M S S + M N N N where M S R p q has rank q, M N R p (p q) has rank p q, W S R q p has rank q, and W N R (p q) p has rank p q. Risk et al. (2017) presented a parametric LNGCA using the logistic density and a semi-parametric LNGCA using tilted Gaussians with cubic B-splines to estimate this model. Virta et al. (2016) used projection pursuit to extract the non-gaussian components and separate the corresponding signal and noise subspaces where the projection index is a convex combination of squared third and fourth cumulants. In this paper, we study the LNGCA model by taking advantage of its flexibility in the number of Gaussian components, and mutual independence assumption between all components. With pre-whitening, the Gaussian contribution to the model likelihood is invariant to linear transformations that preserve unit variance, as shown in Risk et al. (2017). Thus, an alternative framework is necessary in order to leverage the information in the Gaussian subspace. This motivates our novel objective function, which estimates the unmixing matrix W by maximizing the from Gaussianity for the non-gaussian components and minimizing the for the Gaussian components, thereby explicitly estimating the Gaussian subspace to improve upon constrained maximum likelihood approaches. The rest of this paper is organized as follows. In Section 2, we introduce the functions to measure the distance from Gaussianity. In Section 3, we propose a framework of LNGCA estimation given the number of non-gaussian components q. In Section 4, we introduce a sequence of statistical tests to determine the number of non-gaussian components q when it 5

6 is unknown. We present the simulation results in Section 5, followed by real data examples in Section 6. Finally, Section 7 is the summary of our work. The following notations will be used throughout this paper. Let O a b denote the set of a b matrices whose columns are orthonormal. Let P ± a a denote the set of a a signed permutation matrices. Let U F = i,j U 2 ij denote the Frobenius norm of U Ra b. 2 Discrepancy 2.1 Population Discrepancy Measures In order to find the best estimate for the LNGCA model, we need a criterion to measure the between X and its underlying assumption, i.e., S should be far from Gaussianity and N should be close to Gaussianity. Specifically, we choose a general class of functions D that measure the D between each component X j and Gaussianity. Hastie and Tibshirani (2003) proposed the expected log-likelihood tilt function to measure the from Gaussianity in the estimation of the ICA model. Suppose the density of X j is f j, j = 1,..., p, and each of the densities f j is represented by an exponentially tilted Gaussian density f j (x j ) = φ(x j )e g j(x j ) where φ is the standard univariate Gaussian density, and g j is a smooth function. The logtilt function g j represents departures from Gaussianity, and the expected log-likelihood ratio between f j and the Gaussian density is GPois(X j ) = E[g j (X j )]. 6

7 Virta et al. (2015, 2016) proposed the use of the Jarque-Bera (JB) test statistic (Jarque and Bera, 1987) JB(X j ) = Skew(X j ) Kurt(X j) to measure the from Gaussianity in the estimation of ICA and LNGCA models, where Skew(X j ) = ( E[X 3 j ] ) 2, Kurt(X j ) = ( E[X 4 j ] 3 ) 2 are squared skewness and squared excess kurtosis. In fact, Virta et al. (2015, 2016) studied a linear combination of Skew and Kurt, i.e., αskew + (1 α)kurt, and advised the choice of α = 0.8, which corresponds to JB. This takes deviation of both skewness and kurtosis into account, while Skew and Kurt are valid functions as well. Notice that JB(X j ), Skew(X j ), and Kurt(X j ) are simplified due to standardized X j. 2.2 Empirical Discrepancy Measures Let Y = {Y i from F Y, and let Y j = (Y i 1,..., Y i p ) : i = 1,..., n} R n p be an i.i.d. sample of observations = {Y i j : i = 1,..., n} R p be the corresponding i.i.d. sample of observations from F Yj, j = 1,..., p, such that Y = {Y 1,..., Y p }. Let Σ Y be the sample covariance matrix of Y, and Ĥ = Σ 1/2 Y be the estimated uncorrelating matrix. Although the covariance Σ Y is unknown in practice, the sample covariance Σ Y is a consistent estimate under the finite second-moment assumption. Let Ẑ = YĤT R n d be the estimated uncorrelated observations, such that ΣẐ = I d, and ΣẐ a.s. I d as n. To simplify notation, we assume that Z, an uncorrelated i.i.d. sample is given with mean zero and unit variance. Let X = {X i = (X i 1,..., X i p) : i = 1,..., n} = [S, N] = ZW T R n p be the sample of X, where S R n q and N R n (p q), and let X j = {X i j : i = 1,..., n} 7

8 R n be the sample of X j, i.e., the jth column in X. Similarly, we can define S j, N j R n. Notice that X j, S j, N j has sample mean 0 and sample variance 1. We obtain the empirical D by replacing expectations by sample averages. The empirical GPois is given by GPois(X j ) = 1 n n ĝ j (Xj) i i=1 where ĝ j is estimated by maximum penalized likelihood, maximizing the criterion p j=1 { 1 n n [ log φ(x i j ) + ĝ j (Xj) ] i λ j i=1 ĝ j 2 (x)dx } subject to φ(s)eĝj(x) dx = 1 where ĝ j is estimated by a smoothing spline, and λ j is selected by controlling the degrees of freedom of the smoothing spline, which is 6 by default in the R package ProDenICA (Hastie and Tibshirani, 2010). The empirical JB is given by JB(X j ) = Skew(X j ) Kurt(X j) where Skew(X j ) = Kurt(X j ) = ( 1 n ( 1 n ) 2 n (Xj k ) 3, k=1 n (Xj k ) 4 3 k=1 are the empirical Skew and empirical Kurt. We will see that JB (joint use of skewness and ) 2 8

9 kurtosis) performs much better than either Skew (use of skewness only) or Kurt (use of kurtosis only) alone in the simulations of Section 5, which was shown in Virta et al. (2016) as well. 3 Optimization Strategy Using D to measure the difference between X j and Gaussianity, we seek an optimal W such that X is most likely to fit the underlying model with independent components. For the ICA model, a classical ICA estimator to estimate W in FastICA (Hyvärinen and Oja, 1997) and ProDenICA (Hastie and Tibshirani, 2003) is defined by Ŵ = arg max W O p p p j=1 D(X j ). We can naturally extend the ICA estimator to an LNGCA estimator given q as Ŵ max S = arg max W O p q j:x j S D(X j ) = arg max W O p q q j=1 D(S j ) (3) which is named the max estimator, as we maximize the between non-gaussian components and Gaussianity. The algorithm for the max estimator is described in Alg. 1, where the fixed point algorithm is elaborated in Hastie and Tibshirani (2003). The objective function used in Spline-LCA from Risk et al. (2017) is the same as the max estimator when f is GPois, but the optimization differs, which will be explored in Section 5. are Given the estimated unmixing matrix Ŵ S max, the estimated non-gaussian components Ŝ = Z(Ŵ max S ) T. Since any rotation of a Gaussian distribution will lead to the same Gaussian distribution, the Gaussian components N are not identifiable. However, we can benefit from estimating the Gaussian subspace for the LNGCA model, since the column space of W N is identifiable. 9

10 Algorithm 1 LNGCA algorithm for the max estimator 1. Initialize W p q. 2. Alternate until convergence of W, using the Frobenius norm. (a) Given W, estimate the D(S j ) of component S j for each j. (b) Given D(S j ), j = 1,..., q, perform one step of the fixed point algorithm towards finding the optimal W. Taking N into account by optimizing S and N simultaneously in the objective function, we expect to recognize the Gaussian subspace, which helps shape the non-gaussian subspace because the non-gaussian subspace is the complement of the Gaussian subspace. Motivated by this optimization idea, we propose a new LNGCA estimator given q as Ŵ max-min = arg max W O p p j:x j S D(X j ) j:x j N D(X j ) = arg max W O p p [ q j=1 D(S j ) p q j=1 D(N j ) ] (4) which is named the max-min estimator for the LNGCA model, as we maximize the between non-gaussian components and Gaussianity, and minimize the between Gaussian components and Gaussianity simultaneously. The algorithm for the maxmin estimator is described in Alg. 2, where the fixed point algorithm is elaborated in Hastie and Tibshirani (2003). We will see that the max-min estimator (joint optimization of S and N) performs much better than the max estimator (optimization of S only) in the simulations of Section 5. Algorithm 2 LNGCA algorithm for the max-min estimator 1. Initialize W p p. 2. Alternate until convergence of W, using the Frobenius metric. (a) Given W, estimate the D(X j ) of component X j for each j. (b) Sort components by D(X j ) in decreasing order. (c) Flip the sign of D(X j ) of the last p q components. (d) Given D(X j ), j = 1,..., p, perform one step of the fixed point algorithm towards finding the optimal W. 3. Sort components by D(X j ) in decreasing order. 10

11 Given the estimated unmixing matrix Ŵ max-min, the estimated non-gaussian and Gaussian components are X = Z(Ŵ max-min ) T. However, it is not clear which component in X belongs to Ŝ or N, since Ŝ and N are obtained together instead of Ŝ only. The solution is to sort the independent components X 1,..., X p by value D(X i ) in decreasing order, and obtain the ordered independent components X (1),..., X (p). Given that there are q non- Gaussian components, it is natural to take S = (X (1),..., X (q) ) T and N = (X (q+1),..., X (p) ) T based on the function measuring non-gaussianity. As the q non-gaussian components in S have the q-largest values D among X 1,..., X p, the estimated non- Gaussian components in Ŝ are expected to have the q-largest empirical values D among X 1,..., X p. Nevertheless, we cannot sort X by the empirical to determine which component in X belongs to S or N at the beginning, and then stick to the order throughout the iterative algorithm and conclude which component in X belongs to Ŝ or N in the end, since the optimization does depend on the initialization, and the order of components may change after each iteration. Instead, we repeatedly sort X by empirical value and adaptively determine the components in S and N at the end of each iteration in Alg 2. Finally, when the algorithm converges, we sort the estimated components X 1,..., X p by empirical value, and obtain the ordered estimated components X (1),..., X (p). Then we take Ŝ = [ X (1),..., X (q) ], and N = [ X (q+1),..., X (p) ]. Accordingly, we decompose Ŵ into ŴS and ŴN, and M = Ŵ T into M S and M N. 4 Testing and Subspace Estimation In practice, the number of non-gaussian components q is unknown. Following the convention of ordered components with respect to non-gaussianity, we introduce a sequence of statistical tests to decide q. The main idea is that, for any j < j, X (j ) is more likely to be non- 11

12 Gaussian than X (j) in terms of value D. If there are k non-gaussian independent components, then X (1),..., X (k) are non-gaussian, and X (k+1),..., X (p) are Gaussian. Based on this heuristic, we propose a sequence of hypotheses for searching q as H (k) 0 : X (1),..., X (k 1) are non-gaussian and X (k),..., X (p) are Gaussian, H (k) A : X (1),..., X (k) are non-gaussian which is equivalent to testing whether there are exactly k 1 non-gaussian components or at least k non-gaussian components. Under H (k) 0, we first run the optimization from X = ZW T using the max-min estimator with q = k 1, in which we estimate Ŵ and X = [ X (1),..., X (p) ] from the sample data Z. One thing worth mentioning is that X depends on k as the optimization depends on k, although we suppress the notation here. Next we repeat the following resampling procedure for B times: during the bth time, we randomly generate independent Gaussian G (b) = [G (b) 1,..., G (b) p k+1 ] with the same number of observations as Z, and construct pseudo components X (b) = [ X (1),..., X (k 1), G (b) ]. Based on the estimated unmixing matrix Ŵ, we use the estimated mixing matrix M = Ŵ T to construct pseudo observations Z (b) = X (b) M T. Then we run the optimization from X (b) = Z (b) W T using the max-min estimator with q = k 1, and we estimate Ŵ (b) and X (b) = [ (b) (b) X (1),..., X (p) ] from the pseudo data Z(b). At last, we calculate an approximate p-value by comparing D( X (k) ) to k D( X j=1 (j) ) to k (b) j=1 D( X (j) ) as (b) D( X (k)), or p curr = p cumu = { # D( X(k) ) D( X (b) (k) ) }, { B k # D( X j=1 (j) ) k j=1 B D( X (b) (j) ) } (5) 12

13 which we name the current method and the cumulative method respectively. Our test shares the resampling technique with Nordhausen et al. (2017). However, there are two major differences. On the one hand, our test does not need to bootstrap on X, and thus saves remarkable computational cost, and we will show that it accurately estimates the number of components. On the other hand, our test is more flexible on the test statistic, as it does not need to match what is used in the objective function in the optimization. The algorithm for our sequential test is summarized in Alg. 3 below. Algorithm 3 The algorithm for the sequential test H (k) 0 1. Estimate Ŵ from X = ZW T using the max-min estimator with q = k Estimate X = ZŴ T = [ X (1),..., X (p) ]. 3. Repeat the procedure for B times: (a) Generate independent Gaussian G (b) = [G (b) 1,..., G (b) p k+1 ]. (b) Construct X (b) = [ X (1),..., X (k 1), G (b) ]. (c) Construct Z (b) = X (b) M T = X (b) Ŵ. (d) Estimate Ŵ (b) from X (b) = Z (b) W T using the max-min estimator with q = k 1. (e) Estimate X (b) = Z (b) (Ŵ b ) T (b) (b) = [ X (1),..., X (p) ]. 3. Calculate p-value using the current or cumulative method in (5). The proposed procedure involves a sequence of tests, but the number of tests can be dramatically reduced by using a binary search. This approach quickly narrows in on the selected q because we focus on the boundary that the p-value crosses a specific significance level. As we expect no more than log 2 p tests, it makes sense to apply the Bonferroni correction. Note that even for fairly large p, the number of tests remains reasonable, e.g., p = 10, 000 implies fewer than fourteen tests. Multiple testing in this setting of sequential testing may become more problematic as the dimension or search space grows, though the sequential searching works well in the simulations of Section 5. Issues with multiple testing is an important direction for future research. 13

14 5 Simulation Study 5.1 Sub- and Super-Gaussian Densities In this section, we evaluate the performance of the max-min estimator by performing simulations similar to Matteson and Tsay (2017) for the LNGCA model, and compare it to that of the max estimator using several functions including Skew, Kurt, JB, GPois, and Spline. Moreover, we elaborate on the implementation and performance measure of the LNGCA model. We generate the non-gaussian independent components S R n q from 18 distributions using rjordan in the R package ProDenICA (Hastie and Tibshirani, 2010) with sample size n and dimension q. See Figure 1 for the density functions of the 18 distributions. We also generate the Gaussian independent components N R n (p q) with sample size n and dimension p q. Then X = [S, N] are the underlying components of interest. We simulate a mixing matrix A R p p with condition number between 1 and 2 using mixmat in the R package ProDenICA (Hastie and Tibshirani, 2010) and obtain the observations Y = XA T, which are centered by their sample mean, then pre-whitened by their sample covariance to obtain uncorrelated observations Z = YĤT. Finally, we estimate ŴS and M S = Ŵ T S based on Z via the max estimator or the max-min estimator. Therefore, Z = XA T Ĥ T = X(ĤA)T, and we evaluate the estimation by comparing the estimated unmixing matrix Ŵ to the ground truth W 0 = (ĤA) 1 = A 1 Ĥ 1 = BĤ 1 with respect to S, i.e., comparing ŴS to WS 0 where W S 0 = BSĤ 1. The optimization problem associated with the max estimator in (3) and the max-min estimator in (4) is non-convex, which requires the initialization step and is sensitive to the initial point. Risk et al. (2014) demonstrated strong sensitivity to the initialization matrix in various ICA algorithms for the eighteen distributions considered in the experiments below. To mitigate the presence of local maximum, we explore two options, one with a single initial 14

15 point, and another with multiple initial points, where each initial point is generated by orthogonalizing matrices with random Gaussian elements. We suggest that the number of multiple initial points m should grow with the dimension p, e.g., m = p. Each method returns an estimate for the mixing matrix. To jointly measure the uncertainty associated with both pre-whitening observations and estimating non-gaussian components, we introduce an measure to evaluate the between ŴS and WS 0 as min Q P ± p p 1 pq W 0 S ŴSQ 2 F which is similar to the measures in Ilmonen et al. (2010), Risk et al. (2017), and Miettinen et al. (2017). The infimum above is taken such that the measure is invariant to the sign and order of components with respect to the ambiguities associated with the LNGCA model, and the optimal Q is solved by the Hungarian method (Papadimitriou and Steiglitz, 1982). We compare the max-min estimator to the max estimator with various distributions, dimensions of components, and functions in Experiment 1 and 2 below. Experiment 1 (Different distributions of components). We sample S from one of the 18 distributions with q = 2, p = 4, and n = See Figure 2 for the measures of 100 trials, with both multiple initial points (m = 4) and a single initial point (m = 1). For both multiple initial points and a single initial point, the measure of the maxmin estimator is much lower than that of the max estimator for most distributions and functions. Therefore, the max-min estimator improves the performance of estimation over the max estimator, no matter whether a single initial point or multiple initial points is used in optimization. For both the max-min estimator and max estimator, the measure with multiple initial points is much lower than that with a single initial point for most of the distributions and functions, which illustrates the advantage of using multiple initial points 15

16 over a single initial point. Moreover, the max-min estimator and multiple initial points turns out to be a powerful combination, since the measure of the max estimator with multiple initial points can be further reduced when replacing the max estimator with the max-min estimator. The measure of JB is much lower than that of Skew and Kurt for most of the distributions, which justifies the joint use of moments. In addition, GPois is equal and often better than other functions for all the distributions, especially with multiple initial points. Experiment 2 (Different dimensions of components). We sample S from q randomly selected distributions of the 18 distributions, with q {2, 4, 8, 16}, p = 2q, n = 500q. See Figure 3 for the measures of 100 trials, with both multiple initial points (m = p) and a single initial point (m = 1). As in the previous experiment, the max-min estimator improves the performance of estimation over the max estimator, where the measure with multiple initial points is much lower than that with a single initial point for most cases. In addition, GPois performs the best for q = 2, 4, 8, and JB and GPois perform similarly for q = 16 with the max-min estimator and multiple initial points. Since GPois turns out to be more robust to different distributions than Spline in the simulations, and it shares the same idea with Spline, we omit the results of Spline in the following simulation experiments and data examples. We compare the current method to the cumulative method for selecting q with various sample sizes of components, and functions using the max-min estimator in Experiment 3 below. Experiment 3 (Selecting q with varying n). We sample S from q randomly selected distributions of the 18 distributions, with q = 2, p = 4, n {2000, 4000, 8000}, B = 200. See 16

17 Table 2 and 3 for the empirical size and power of 100 trials, with significance level α = 5%, and both multiple initial points (m = 4) and a single initial point (m = 1). For both multiple initial points and a single initial point, the empirical power of the current method is much higher than that of the cumulative method, while both methods have empirical size around 5% or even lower, for all the sample sizes and functions. Hence, the current method outperforms the cumulative method in testing, no matter whether a single initial point or multiple initial points is used in optimization. For both the current method and cumulative method, the empirical size and power with multiple initial points are similar to those with a single initial point, for all the sample sizes and functions, which implies no remarkable effect in testing from using multiple initial points or a single initial point in estimation. This suggests that the estimate of the rank of the subspace is less sensitive to initialization than estimates of the individual components. The empirical power of JB is much higher than that of Skew and Kurt, for all the sample sizes, which justifies the joint use of moments. In addition, GPois outperforms the other functions, for all the sample sizes. 5.2 Image Data Fulfilling a task of unmixing vectorized images similar to Virta et al. (2016), we consider the three gray-scale images from the test images of Computer Vision Group at University of Granada, depicting a cameraman, a clock, and a leopard respectively. Each image is represented by a matrix, where each element indicates the intensity value of a pixel. Three noise images of the same size are simulated with independent standard Gaussian pixels. We standardize the six images such that the intensity values across all the pixels in each image have mean zero and unit variance. Then we vectorize each image into a vector of length 256 2, and combine the vectors from all six images as a matrix X, i.e., 17

18 p = 6, n = Thus, each row of X contains the intensity values of a single pixel across all images, and each column of X contains the intensity values of a single image. Then we simulate a mixing matrix A R p p using mixmat in the R package ProDenICA (Hastie and Tibshirani, 2010), and mix the six images to obtain the observations Y = XA T, which are centered by their sample mean, then pre-whitened by their sample covariance to get uncorrelated observations Z = YĤT. We aim to infer the number of true images, and then estimate the intensity values in them. First, we run the sequential test to infer the number of true images q with B = 200. See Table 1 for the p-values corresponding to each k with a single initial point (m = 1). Both the current method and cumulative method correctly select q = 3 with significance level α = 5%, for all the functions. Second, we estimate the intensity values Ŝ with q = 3 and multiple initial points (m = 3). See Figures 4 and 5 for the recovered images Ŝ and images Ŝ S, where the Euclidean norm of vectorized images is used to evaluate the accuracy of estimation. The max-min estimator outperforms the max estimator for Kurt, as the max-min estimator recovers the second image, while the second image recovered by the max estimator is masked by noise, and also the max-min estimator has much lower than the max-min estimator in term of the first image recovered, which illustrates the advantage of the max-min estimator over the max estimator, especially when the max estimator does not perform well. For the other functions, both the max-min estimator and max estimator nicely recover the true images. The estimation of JB is more accurate than that of Skew and Kurt, as its recovered images are mixed with less noise, indicated by both the estimated images and images. In addition, JB and GPois have similar performance, as JB achieves the lowest on the first image while GPois achieves the lowest on the second image. 18

19 6 EEG Data There are 24 subjects in the EEG data from the Human Ecology Department at Cornell University, where each subject receives 20 trials. In each trial, 128 EEG channels (3 unused) were collected with 1024 sample points for a few seconds. We study the first trial of the first subject. The data of interest is represented by a matrix, i.e., p = 125, n = Here, we estimate the number of non-gaussian signals and examine their time series. Since the max-min estimator and the current method with GPois perform the best in estimation and testing of the simulations, we only use the max-min estimator and the current method with GPois in this application. First, we conduct the sequential test to estimate the number of non-gaussian signals q with B = 200. Using the binary search for p = 125, we expect to have at most log = 7 tests. Hence, we correct the significance level α to 0.714% from the original level 5%. See Figure 6 for the test statistic values (empirical ) and critical values at significance level α {0.714%, 5%, 10%} (i.e., %, 95%, and 90% quantiles of (b) ˆD(X (k))) corresponding to k {63, 94, 110, 118, 114, 116, 115} chosen from the binary search with a single initial point (m = 1). The current method rejects the null hypothesis that there are exactly 114 components (p-value < corrected α) and fails to reject the null hypothesis that there are exactly 115 non-gaussian components (p-value > corrected α), thus selecting q = 115. We also iterate all k = 1,..., p and provide the complete testing results for reference. See Figure 7 for the test statistic values and critical values at significance level α {0.714%, 5%, 10%} corresponding to each k with a single initial point (m = 1). The dashed lines pinpoint where test statistic values meet with critical values, indicating that this component is assumed to be Gaussian because we cannot reject the null hypothesis. Second, we estimate the true signals Ŝ with q = 115 and multiple initial points (m = 100). See Figure 8 for the estimated signals Ŝ. The max-min estimator successfully extracts 19

20 meaningful first and second components, which may be artifacts related to eyeblinks in the middle and at the end of the trial. The 115th and 116th components are likely to be Gaussian, as they are on the boundary of the p-value = 0.714%. The 125th (last) component is fairly close to Gaussian, compared to the Gaussian noise we randomly generate with the same sample size as a reference distribution. 7 Conclusion In this paper, we study the LNGCA model as a generalization of the ICA model, which can have any number of non-gaussian components and Gaussian components, given that all components are mutually independent. Our contributions are the following: (1) We propose a new max-min estimator, maximizing the of each non- Gaussian component from Gaussianity and minimizing the of each Gaussian component from Gaussianity simultaneously. On the contrary, the existing max estimator only maximizes the of each non-gaussian component from Gaussianity, which has been used in the ICA model (Hastie and Tibshirani, 2003) and the LNGCA model (Risk et al., 2017). Our approach may seem unintuitive because the individual Gaussian components are not identifiable. However, the Gaussian subspace is identifiable, and joint estimation of the non-gaussian components and Gaussian components balances the non- Gaussian subspace with the Gaussian subspace. This helps shape the non-gaussian subspace, and thus improves the accuracy of estimating the non-gaussian components. (2) In practice, we need to choose the number of non-gaussian components. We introduce a sequence of statistical tests based on generating Gaussian components and ordering estimated components by empirical, which is computationally efficient with a binary search to reduce the actual number of tests. Two methods with different test statistics are proposed, where the current method considers the value of the component 20

21 under investigation, while the cumulative method considers the total value of all the components from the first one up to the one under investigation. Although our test shares some characteristics with that of Nordhausen et al. (2017), it has less computational burden with no bootstrap needed and is more flexible in choosing the test statistics. We evaluate the performance of our methods in simulations, demonstrating that the maxmin estimator outperforms the max estimator given the number of non-gaussian components for different functions, dimensions, and distributions of the components, no matter whether a single initial point or multiple initial points is used in optimization. When the number of non-gaussian components is unknown, our statistical test successfully finds the correct number with different functions, and sample sizes, where the current method is more powerful than the cumulative method. In the task of recovering true images from mixed image data, our test determines the correct number of true images, and we illustrate the advantage of the max-min estimator over the max estimator through some functions. Specifically, the max-min estimator nicely recovers the images while the max estimator fails using the same function, and the estimation of the max-min estimator is equal and sometimes lower than of the max estimator. In the task of exploring EEG data, our test finds a large number of non-gaussian signals, and it extracts two components as the first two non-gaussian components that may correspond to eye-blink artifacts. The distributions of estimated signals tend to become more Gaussian as their empirical values decrease. There are a large number of non-gaussian components in this data set. In data applications, applying a preliminary data reduction step using principal component analysis (PCA) would likely remove non-gaussian signals. This underscores the importance of a flexible estimation and testing procedure. There can be two directions for the future research. One is to look for a better way to address the multiple testing issue in searching a suitable q. Another one is to better 21

22 understand the improvements with the max-min estimator from a theoretical perspective. Our intuition is that the contributions of the non-gaussian components to the asymptotic variances would equal zero. Therefore, it would be great to gain additional insight into the statistical versus computational advantages of the max-min estimator. References F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of machine learning research, 3(Jul):1 48, D. M. Bean. Non-gaussian component analysis. PhD thesis, University of California, Berkeley, A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6): , G. Blanchard, M. Sugiyama, M. Kawanabe, V. Spokoiny, and K.-R. Müller. Non-gaussian component analysis: a semi-parametric framework for linear dimension reduction. In Advances in Neural Information Processing Systems, pages , J.-F. Cardoso. Source separation using higher order moments. In Acoustics, Speech, and Signal Processing, ICASSP-89., 1989 International Conference on, pages IEEE, J.-F. Cardoso and A. Souloumiac. Blind beamforming for non-gaussian signals. In IEE proceedings F (radar and signal processing), volume 140, pages IET, T. Hastie and R. Tibshirani. Independent components analysis through product density estimation. In Advances in neural information processing systems, pages , T. Hastie and R. Tibshirani. Prodenica: Product density estimation for ica using tilted gaussian density estimates. R package version, 1, A. Hyvärinen and E. Oja. A fast fixed-point algorithm for independent component analysis. Neural computation, 9(7): , A. Hyvärinen, J. Karhunen, and E. Oja. Independent component analysis, volume 46. John Wiley & Sons, P. Ilmonen, K. Nordhausen, H. Oja, and E. Ollila. A new performance index for ica: properties, computation and asymptotic analysis. Latent Variable Analysis and Signal Separation, pages ,

23 C. M. Jarque and A. K. Bera. A test for normality of observations and regression residuals. International Statistical Review/Revue Internationale de Statistique, pages , Z. Jin and D. S. Matteson. Independent component analysis via energy-based mutual dependence measures. Under review, M. Kawanabe, M. Sugiyama, G. Blanchard, and K.-R. Müller. A new algorithm of nongaussian component analysis with radial kernel functions. Annals of the Institute of Statistical Mathematics, 59(1):57 75, D. S. Matteson and R. S. Tsay. Independent component analysis via distance covariance. Journal of the American Statistical Association, 112(518): , J. Miettinen, K. Nordhausen, and S. Taskinen. Blind source separation based on joint diagonalization in r: The packages jade and bssasymp. Journal of Statistical Software, 76, K. Nordhausen, H. Oja, D. E. Tyler, and J. Virta. Asymptotic and bootstrap tests for the dimension of the non-gaussian subspace. IEEE Signal Processing Letters, 24(6): , C. H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and complexity. Prentice-Hall, Inc., B. B. Risk, D. S. Matteson, D. Ruppert, A. Eloyan, and B. S. Caffo. An evaluation of independent component analyses with an application to resting-state fmri. Biometrics, 70 (1): , B. B. Risk, D. S. Matteson, and D. Ruppert. Linear non-gaussian component analysis via maximum likelihood. Journal of the American Statistical Association, To appear. H. Sasaki, G. Niu, and M. Sugiyama. Non-gaussian component analysis with log-density gradient estimation. In Artificial Intelligence and Statistics, pages , H. Shiino, H. Sasaki, G. Niu, and M. Sugiyama. Whitening-free least-squares non-gaussian component analysis. arxiv preprint arxiv: , F. J. Theis, M. Kawanabe, and K.-R. Muller. Uniqueness of non-gaussianity-based dimension reduction. IEEE Transactions on signal processing, 59(9): , J. Virta, K. Nordhausen, and H. Oja. Joint use of third and fourth cumulants in independent component analysis. arxiv preprint arxiv: , J. Virta, K. Nordhausen, and H. Oja. Projection pursuit for non-gaussian independent components. arxiv preprint arxiv: ,

24 a b c d e f g h i j k l m n o p q r Figure 1: Density plots of the 18 distributions from rjordan in the R package ProDenICA. Table 1: p-values of both current method and cumulative method with q = 3, p = 6, n = 256 2, B = 200, α = 5%, and a single initial point (m = 1) in testing for the image data. Discrepancy Method k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 Skew current cumulative Kurt current cumulative JB current cumulative GPois current cumulative

25 a 0.3 b 0.6 c d 0.5 e f g h i j k l m n o p q r single + max single + maxmin multi + max multi + maxmin estimation Figure 2: Error measures of both max estimator and max-min estimator with q = 2, p = 4, n = 1000, 100 trials, and both multiple initial points (m = 4) and a single initial point (m = 1) in Experiment 1. 25

26 q = 2 q = q = q = 16 estimation single + max single + maxmin multi + max multi + maxmin Figure 3: Error measures of both max estimator and max-min estimator with p = 2q, n = 500q, 100 trials, and both multiple initial points (m = p) and a single initial point (m = 1) in Experiment 2. 26

27 Table 2: Empirical size and power of both current method and cumulative method with q = 2, p = 4, B = 200, 100 trials, α = 5%, and a single initial point in Experiment 3. n Discrepancy Method power size k = 1 k = 2 k = 3 k = 4 Skew current cumulative Kurt current cumulative JB current cumulative GPois current cumulative Skew current cumulative Kurt current cumulative JB current cumulative GPois current cumulative Skew current cumulative Kurt current cumulative JB current cumulative GPois current cumulative

28 Table 3: Empirical size and power of both current method and cumulative method with q = 2, p = 4, B = 200, 100 trials, α = 5%, and multiple initial points in Experiment 3. n Discrepancy Method power size k = 1 k = 2 k = 3 k = 4 Skew current cumulative Kurt current cumulative JB current cumulative GPois current cumulative Skew current cumulative Kurt current cumulative JB current cumulative GPois current cumulative Skew current cumulative Kurt current cumulative JB current cumulative GPois current cumulative

Figure 4: Recovered images of both max estimator and max-min estimator with q = 3, p = 6, n = 2562, and multiple initial points (m = 3) in estimation for the image data.

29 Figure 4: Recovered images of both max estimator and max-min estimator with q = 3, p = 6, n = 2562, and multiple initial points (m = 3) in estimation for the image data. Each value on title is the Euclidean norm of the vectorized image corresponding to the recovered image. We apply a signed permutation to the images and modify the gray scales for illustration purpose. 29

Figure 5: Error images of both max estimator and max-min estimator with q = 3, p = 6, n = 256 2, and multiple initial points (m = 3) in estimation for the image data.

30 Figure 5: Error images of both max estimator and max-min estimator with q = 3, p = 6, n = 256 2, and multiple initial points (m = 3) in estimation for the image data. Each value on title is the Euclidean norm of the vectorized image. We apply a signed permutation to the images and modify the gray scales for illustration purpose. 30

31 value test stat 10% crit 5% crit 0.714% crit k Figure 6: Test statistics and critical values of current method for testing k from binary search with p = 125, n = 1024, B = 200, and a single initial point (m = 1) in testing for the EEG data. 31

32 value test stat 10% crit 5% crit 0.714% crit k value test stat 10% crit 5% crit 0.714% crit k Figure 7: Test statistics and critical values of current method for testing all k with p = 125, n = 1024, B = 200, and a single initial point (m = 1) in testing for the EEG data. 32

33 comp 1 (GPois = ) comp 1 (GPois = ) Frequency value value time comp 2 (GPois = ) comp 2 (GPois = ) Frequency value value time comp 115 (GPois = 0.056) comp 115 (GPois = 0.056) Frequency value value time comp 116 (GPois = ) comp 116 (GPois = ) Frequency value value time comp 125 (GPois = 0.024) comp 125 (GPois = 0.024) Frequency value value time random noise (GPois = ) random noise (GPois = ) Frequency value value time Figure 8: Estimated signals of max-min estimator with q = 115, p = 125, n = 1024, and multiple initial points (m = 100) in estimation for the EEG data. 33

CIFAR Lectures: Non-Gaussian statistics and natural images

CIFAR Lectures: Non-Gaussian statistics and natural images Dept of Computer Science University of Helsinki, Finland Outline Part I: Theory of ICA Definition and difference to PCA Importance of non-gaussianity