Independent Component (IC) Models: New Extensions of the Multinormal Model

Size: px

Start display at page:

Download "Independent Component (IC) Models: New Extensions of the Multinormal Model"

Ferdinand Riley
6 years ago
Views:

1 Independent Component (IC) Models: New Extensions of the Multinormal Model Davy Paindaveine (joint with Klaus Nordhausen, Hannu Oja, and Sara Taskinen) School of Public Health, ULB, April 2008

2 My research is in multivariate statistics, where several (p, say) measurements are recorded on each of the n individuals. We want to come up with models that are potentially useful for a broad range of setups (p << n, though).

3 My research is in multivariate statistics, where several (p, say) measurements are recorded on each of the n individuals. We want to come up with models that are potentially useful for a broad range of setups (p << n, though). In those models, we develop procedures that are robust to some possible model misspecification

4 My research is in multivariate statistics, where several (p, say) measurements are recorded on each of the n individuals. We want to come up with models that are potentially useful for a broad range of setups (p << n, though). In those models, we develop procedures that are robust to some possible model misspecification robust to possible outlying observations (crucial in the multivariate case!)

7 My research is in multivariate statistics, where several (p, say) measurements are recorded on each of the n individuals. We want to come up with models that are potentially useful for a broad range of setups (p << n, though). In those models, we develop procedures that are robust to some possible model misspecification robust to possible outlying observations yet efficient...

8 Outline Introduction 1 Introduction A (too?) simple multivariate problem Normal and elliptic models 2 What is it? How does it work? vs PCA 3 Definition Inference

9 Outline Introduction A (too?) simple multivariate problem Normal and elliptic models 1 Introduction A (too?) simple multivariate problem Normal and elliptic models 2 What is it? How does it work? vs PCA 3 Definition Inference

10 A (too?) simple multivariate problem Normal and elliptic models cigarette sales in packs per capita per capita disposable income

11 A (too?) simple multivariate problem Normal and elliptic models X i = ( Xi1 X i2 ) = ( ) sales (after - before) for state i, i = 1,...,n income (after - before) for state i

12 A (too?) simple multivariate problem Normal and elliptic models Assume one wants to find out, on the basis of the sample X 1, X 2,..., X n, whether the tax reform had an effect (or not) on any of the variables. Typically, in statistical terms, this would translate into testing { H0 : µ j = 0 for all j H 1 : µ j 0 for at least j, at some fixed level α (5%, say).

13 A (too?) simple multivariate problem Normal and elliptic models Assume one wants to find out, on the basis of the sample X 1, X 2,..., X n, whether the tax reform had some fixed specified effect on each variable mean. Typically, in statistical terms, this would translate into testing { H0 : µ j = c j for all j H 1 : µ j c j for at least j, at some fixed level α (5%, say).

14 A (too?) simple multivariate problem Normal and elliptic models The most basic idea is to go univariate, i.e., for each j = 1, 2, to test on the basis of X 1j,..., X nj whether H (j) 0 : µ j = c j holds or not (at level 5%), to reject H 0 as soon as one H (j) 0 has been rejected. This is a bad multivariate testing procedure, since it is easy to show that P[RH 0 ] > 5% under H 0. You cannot properly control the level if you act marginally...

15 A (too?) simple multivariate problem Normal and elliptic models cigarette sales in packs per capita per capita disposable income

16 A (too?) simple multivariate problem Normal and elliptic models The most basic idea is to go univariate, i.e., for each j = 1, 2, to test on the basis of X 1j,..., X nj whether H (j) 0 : µ j = c j holds or not (at level 5%), to reject H 0 as soon as one H (j) 0 has been rejected. This is a bad multivariate testing procedure, since it is easy to show that P[RH 0 ] > 5% under H 0. You cannot properly control the level if you act marginally...

17 A (too?) simple multivariate problem Normal and elliptic models Confidence zones also cannot be built marginally...

18 A (too?) simple multivariate problem Normal and elliptic models Confidence zones also cannot be built marginally...

19 A (too?) simple multivariate problem Normal and elliptic models Hence there is a need for multivariate modelling. The most classical model the multivariate normal model specifies that the common density of the X i s is of the form f X (x) exp( (x µ) Σ 1 (x µ)/2). A necessary condition for to hold is that each of the p variables is normally distributed. Hence, even for p = 2, 3, it is extremely unlikely that the underlying distribution is multivariate normal... (you need to win at Euromillions p times in a row!)

20 A (too?) simple multivariate problem Normal and elliptic models

21 A (too?) simple multivariate problem Normal and elliptic models Not quite the same model!

22 A (too?) simple multivariate problem Normal and elliptic models

23 A (too?) simple multivariate problem Normal and elliptic models The marginals are far from Gaussian...

24 A (too?) simple multivariate problem Normal and elliptic models Does it hurt?

25 A (too?) simple multivariate problem Normal and elliptic models Does it hurt? Oh yes, it does... For H 0 : µ = µ 0, the Gaussian LR test (i) is efficient at the multinormal only, and (ii) is valid only if variances exist (what about financial series?) For H 0 : Σ = Σ 0, the Gaussian LR test is valid at the multivariate normal distribution!

26 A (too?) simple multivariate problem Normal and elliptic models Does it hurt? Oh yes, it does... For H 0 : µ = µ 0, the Gaussian LR test (i) is efficient at the multinormal only, and (ii) is valid only if variances exist (what about financial series?) For H 0 : Σ = Σ 0, the Gaussian LR test is valid at the multivariate normal distribution! Remarks: Even for n = Incidently, those tests are not robust w.r.t. possible outliers.

27 A (too?) simple multivariate problem Normal and elliptic models

28 A (too?) simple multivariate problem Normal and elliptic models An equivalent definition of the multivariate normal distribution specifies that where X = A(RU) + µ, U is uniformly distributed on the unit sphere in R p R 2 χ 2 p is independent of U A is a constant p p matrix µ is a constant p-vector

29 A (too?) simple multivariate problem Normal and elliptic models

30 A (too?) simple multivariate problem Normal and elliptic models

31 A (too?) simple multivariate problem Normal and elliptic models An equivalent definition of the multivariate normal distribution specifies that where X = A(RU) + µ, U is uniformly distributed on the unit sphere in R p R 2 χ 2 p is independent of U A is a constant p p matrix µ is a constant p-vector

32 A (too?) simple multivariate problem Normal and elliptic models An equivalent definition of the multivariate normal distribution specifies that where X = A(RU) + µ, U is uniformly distributed on the unit sphere in R p R 2 χ 2 p is independent of U A is a constant p p matrix µ is a constant p-vector Elliptical distributions allow for an arbitrary distribution R.

33 A (too?) simple multivariate problem Normal and elliptic models Elliptical distributions add some flexibility (in particular, allow for heavy tails). but still give raise to marginals with a common (type of) distribution. symmetric marginals a deep multivariate symmetry structure... These stylized facts often are sufficient to rule out the assumption of ellipticity... (no need for a test of ellipticity!) I am burning my old records here!

34 A (too?) simple multivariate problem Normal and elliptic models

35 A (too?) simple multivariate problem Normal and elliptic models Elliptical distributions add some flexibility (in particular, allow for heavy tails). but still give raise to marginals with a common (type of) distribution. symmetric marginals a deep multivariate symmetry structure... These stylized facts often are sufficient to rule out the assumption of ellipticity... (no need for a test of ellipticity!)

36 A (too?) simple multivariate problem Normal and elliptic models And now something completely different..." (Monthy Python Flying Circus, 1970)

37 Outline Introduction What is it? How does it work? vs PCA 1 Introduction A (too?) simple multivariate problem Normal and elliptic models 2 What is it? How does it work? vs PCA 3 Definition Inference

38 What is it? How does it work? vs PCA stands for Independent Component Analysis. It is a technique used in Blind Source Separation problems, such as in the cocktail-party problem": 3 conversations: Z it (i = 1, 2, 3, t = 1,..., n) 3 microphones: X it The goal is to recover the original conversations... Under the only assumption the latter are independent.

39 What is it? How does it work? vs PCA s Z 1t Z 2t Z 3t

40 What is it? How does it work? vs PCA s Z 1t Z 2t Z 3t

41 What is it? How does it work? vs PCA The basic model is X 1t = a 11 Z 1t + a 12 Z 2t + a 13 Z 3t X 2t = a 21 Z 1t + a 22 Z 2t + a 23 Z 3t X 3t = a 31 Z 1t + a 32 Z 2t + a 33 Z 3t, that is, X t = A Z t ; where one assumes all Z it s are mutually independent. Conversations" are independent. No serial dependence.

42 What is it? How does it work? vs PCA The basic model is X 1t = a 11 Z 1t + a 12 Z 2t + a 13 Z 3t X 2t = a 21 Z 1t + a 22 Z 2t + a 23 Z 3t X 3t = a 31 Z 1t + a 32 Z 2t + a 33 Z 3t, that is, X t = A Z t ; where one assumes all Z it s are mutually independent. Conversations" are independent. No serial dependence. The mixing matrix A does not depend on t.

43 What is it? How does it work? vs PCA For BW images, Z ij {0, 1,..., 255} represents the grey intensity of the ith image for the jth pixel (in vectorized form).

44 What is it? How does it work? vs PCA For BW images, Z ij {0, 1,..., 255} represents the grey intensity of the ith image for the jth pixel (in vectorized form).

45 What is it? How does it work? vs PCA For BW images, Z ij {0, 1,..., 255} represents the grey intensity of the ith image for the jth pixel (in vectorized form). Here, n =

46 What is it? How does it work? vs PCA For BW images, Z ij {0, 1,..., 255} represents the grey intensity of the ith image for the jth pixel (in vectorized form). Here, n = And Z i1 = 61, Z i2 = 61,...

47 What is it? How does it work? vs PCA For BW images, Z ij {0, 1,..., 255} represents the grey intensity of the ith image for the jth pixel (in vectorized form). Here, n = And Z i1 = 61, Z i2 = 61,... Minimimal value=45 (dark grey) Maximal value=255 (white)

48 What is it? How does it work? vs PCA Can you guess the Z 1, Z 2, Z 3 which generated this mixture X 1? X 1

49 What is it? How does it work? vs PCA Would you guess who are X 1 X 2

50 What is it? How does it work? vs PCA X 1 X 2 X 3

51 What is it? How does it work? vs PCA magic

52 What is it? How does it work? vs PCA Ẑ 1 Ẑ 2 Ẑ 3

53 What is it? How does it work? vs PCA Z 1 Z 2 Z 3

54 What is it? How does it work? vs PCA Engineers typically estimate A (hence recover sources Ẑt = Â 1 X t ) by choosing the matrix A that makes the marginals of A 1 X t as independent as possible, or as non-gaussian as possible.

55 What is it? How does it work? vs PCA Engineers typically estimate A (hence recover sources Ẑt = Â 1 X t ) by choosing the matrix A that makes the marginals of A 1 X t as independent as possible, or as non-gaussian as possible. Drawbacks: Arbitrary objective functions. Computationally intensive procedures. Lack of robustness.

56 What is it? How does it work? vs PCA We have our own way to do that: A p p scatter matrix S = S(X 1,...,X n ) is a statistic such that S(AX 1,...,AX n ) = AS(X 1,..., X n )A for all p p matrix A. Example: S 2 = 1 n 1 S 1 = 1 n 1 n i=1 n (X i X)(X i X) i=1 [(X i X) S 1 1 (X i X)](X i X)(X i X)

57 What is it? How does it work? vs PCA We have our own way to do that: A p p scatter matrix S = S(X 1,...,X n ) is a statistic such that S(AX 1,...,AX n ) = AS(X 1,..., X n )A for all p p matrix A. Assume lim n S(X 1,..., X n ) is diagonal as soon as the common distribution of the X i s has independent marginals. Then we say S has the independence property. Example: S 2 = 1 n 1 S 1 = 1 n 1 n i=1 n (X i X)(X i X) i=1 [(X i X) S 1 1 (X i X)](X i X)(X i X)

58 What is it? How does it work? vs PCA We have our own way to do that: A p p scatter matrix S = S(X 1,...,X n ) is a statistic such that S(AX 1,...,AX n ) = AS(X 1,..., X n )A for all p p matrix A. Assume lim n S(X 1,..., X n ) is diagonal as soon as the common distribution of the X i s has independent marginals. Then we say S has the independence property. Examples: S 2 = 1 n 1 S 1 = 1 n 1 n i=1 n (X i X)(X i X) i=1 [(X i X) S 1 1 (X i X)](X i X)(X i X)

59 What is it? How does it work? vs PCA We have our own way to do that: A p p scatter matrix S = S(X 1,...,X n ) is a statistic such that S(AX 1,...,AX n ) = AS(X 1,..., X n )A for all p p matrix A. Assume lim n S(X 1,..., X n ) is diagonal as soon as the common distribution of the X i s has independent marginals. Then we say S has the independence property. Theorem Let S 1, S 2 be scatter matrices with the independence property. Then the p p matrix B n, whose columns are the eigenvectors of S 1 2 (X 1,..., X n )S 1 (X 1,..., X n ), is consistent for (A ) 1.

60 What is it? How does it work? vs PCA Proof. By using the definition of a scatter and the independence property, we obtain { S 1 = S 1 (X i ) = S 1 (AZ i ) = AS 1 (Z i )A = AD 1 A S 2 = S 2 (X i ) = S 2 (AZ i ) = AS 2 (Z i )A = AD 2 A, for some diagonal matrices D 1, D 2. Hence, (S 1 2 S 1)A 1 = (AD2 A ) 1 (AD 1 A )A 1 = A 1 (D 1 2 D 1).

61 What is it? How does it work? vs PCA Proof. By using the definition of a scatter and the independence property, we obtain { S 1 = S 1 (X i ) = S 1 (AZ i ) = AS 1 (Z i )A = AD 1 A S 2 = S 2 (X i ) = S 2 (AZ i ) = AS 2 (Z i )A = AD 2 A, for some diagonal matrices D 1, D 2. Hence, (S 1 2 S 1)A 1 = (AD2 A ) 1 (AD 1 A )A 1 = A 1 (D 1 2 D 1).

62 What is it? How does it work? vs PCA Proof. By using the definition of a scatter and the independence property, we obtain { S 1 = S 1 (X i ) = S 1 (AZ i ) = AS 1 (Z i )A = AD 1 A S 2 = S 2 (X i ) = S 2 (AZ i ) = AS 2 (Z i )A = AD 2 A, for some diagonal matrices D 1, D 2. Hence, (S 1 2 S 1)A 1 = (AD2 A ) 1 (AD 1 A )A 1 = A 1 (D 1 2 D 1).

63 What is it? How does it work? vs PCA We have our own way to do that: Theorem Let S 1, S 2 be scatter matrices with the independence property. Then the p p matrix B n, whose columns are the eigenvectors of S 1 2 S 1, is consistent for (A ) 1. Of course, if we choose robust S 1 and S 2, the resulting Â will be robust as well, which guarantees a robust reconstruction of the independent sources...

64 What is it? How does it work? vs PCA With robust S 1, S 2... Ẑ 1 Ẑ 2 Ẑ 3

65 What is it? How does it work? vs PCA With non-robust S 1, S 2... (the ones given above) Ẑ 1 Ẑ 2 Ẑ 3

66 What is it? How does it work? vs PCA PCA makes marginals uncorrelated... makes marginals independent... Actually, is going one step further than PCA: = PCA + a rotation...

67 What is it? How does it work? vs PCA PCA makes marginals uncorrelated... makes marginals independent... Actually, is going one step further than PCA: = PCA + a rotation... This explains PCA is often used as a preliminary step to perform.

68 What is it? How does it work? vs PCA V V2 Raw data V V4 V V

69 What is it? How does it work? vs PCA V V2 Principal components V V4 V V

70 What is it? How does it work? vs PCA V Independent components V2 V V4 V V

71 Outline Introduction Definition Inference 1 Introduction A (too?) simple multivariate problem Normal and elliptic models 2 What is it? How does it work? vs PCA 3 Definition Inference

72 Definition Inference We reject the elliptical model, which states that X i = AZ i + µ, where Z i = (Z i1,...,z ip ) is spherically symmetric (about 0 R p ), in favor of the following: Definition The independent component (IC) model states that X i = AZ i + µ, where Z i = (Z i1,...,z ip ) has independent marginals (with median 0 and MAD 1).

73 Definition Inference provide an extension of the multinormal model, which is obtained when all ICs are Gaussian.

74 Definition Inference provide an extension of the multinormal model, which is obtained when all ICs are Gaussian. Both extensions are disjoint.

75 Definition Inference provide an extension of the multinormal model, which is obtained when all ICs are Gaussian. Both extensions are disjoint. This IC extension is bigger than that of elliptic models. In : µ, A, and p densities g 1,...,g p. In elliptic models: µ, A, and a single density g (that of Z ).

76 Definition Inference As a summary...

77 Definition Inference provide an extension of the multinormal model, which is obtained when all ICs are Gaussian. Both extensions are disjoint. This IC extension is bigger than that of elliptic models. In : µ, A, and p densities g 1,...,g p. In elliptic models: µ, A, and a single density g (that of Z ). The g j s allow for much flexibility. In particular, - we can play with p different kurtosis values... - the X i may very well be asymmetric...

78 Definition Inference

79 Definition Inference Inference problem Test H 0 : µ = 0 for n i.i.d. observations from the IC model X i = AZ i + µ, where Z i = (Z i1,...,z ip ) has independent marginals. The parameters: location vector µ, scatter matrix A, p densities (g 1,..., g p ). Of course, we can hardly assume the g j s to be known, and it is expected that this nuisance will be an important issue.

80 Definition Inference Quite nicely, our estimators Â (based on a couple of scatter S 1, S 2 ) do not require estimating µ nor g 1,...,g p.

81 Definition Inference Quite nicely, our estimators Â (based on a couple of scatter S 1, S 2 ) do not require estimating µ nor g 1,...,g p. We then may (a) write Y i := Â 1 X i = Â 1 AZ i + Â 1 µ Z i + Â 1 µ (1) and (b) go univariate to test componentwise whether the location is 0 (RH (j) 0 for large values of T j, with T j N(0, 1) under H (j) 0 ).

82 Definition Inference Quite nicely, our estimators Â (based on a couple of scatter S 1, S 2 ) do not require estimating µ nor g 1,...,g p. We then may (a) write Y i := Â 1 X i = Â 1 AZ i + Â 1 µ Z i + Â 1 µ (1) and (b) go univariate to test componentwise whether the location is 0 (RH (j) 0 for large values of T j, with T j N(0, 1) under H (j) 0 ). Crucial point: we will be able to aggregate those univariate tests easily because the components are independent (RH 0 for large values of p j=1 T j 2, which is χ 2 p under H 0).

83 Definition Inference Which T j should we choose? Student: T j = n Ȳ.j s.j = 1 n n i=1 Y ij s.j = 1 n n i=1 Sign(Y ij ) Y ij s.j This yields a multivariate Student test (φ N, say), which unfortunately suffers the same drawbacks as classical Gaussian tests:

84 Definition Inference Which T j should we choose? Student: T j = n Ȳ.j s.j = 1 n n i=1 Y ij s.j = 1 n n i=1 Sign(Y ij ) Y ij s.j This yields a multivariate Student test (φ N, say), which unfortunately suffers the same drawbacks as classical Gaussian tests: It cannot deal with heavy tails. It is poorly robust.

85 Definition Inference Which T j should we choose? Student: T j = n Ȳ.j s.j = 1 n n i=1 Y ij s.j = 1 n n i=1 Sign(Y ij ) Y ij s.j This yields a multivariate Student test (φ N, say), which unfortunately suffers the same drawbacks as classical Gaussian tests: It cannot deal with heavy tails. It is poorly robust.

86 Definition Inference Which T j should we choose? Student: T j = n Ȳ.j s.j = 1 n n i=1 Y ij s.j = 1 n n i=1 Sign(Y ij ) Y ij s.j This yields a multivariate Student test (φ N, say), which unfortunately suffers the same drawbacks as classical Gaussian tests: It cannot deal with heavy tails. It is poorly robust.

87 Definition Inference Which T j should we choose? Student: T j = n Ȳ.j s.j = 1 n T j := 1 n at the multinormal, where n i=1 n i=1 Y ij s.j = 1 n n i=1 Sign(Y ij ) Y ij s.j, ( Sign(Y ij )Φ 1 Rij ) +, n + 1 R ij denotes the rank of Y ij among Y 1j,..., Y nj, and Φ + (z) = P[ N(0, 1) z ].

88 Definition Inference How good is the resulting test ( φ N, say), which rejects H 0 for large values of p j=1 T j 2? It is fairly robust to outliers It can deal with heavy tails

89 Definition Inference How good is the resulting test ( φ N, say), which rejects H 0 for large values of p j=1 T j 2? and... It is fairly robust to outliers It can deal with heavy tails it is, at the multinormal, as powerful as φ N! (since T j = T j + o P (1) at the multinormal)

90 Definition Inference How good is the resulting test ( φ N, say), which rejects H 0 for large values of p j=1 T j 2? and... It is fairly robust to outliers It can deal with heavy tails it is, at the multinormal, as powerful as φ N! (since T j = T j + o P (1) at the multinormal) A natural question: How does it compare with φ N (in terms of power) away from the multinormal?

91 Definition Inference The answer is in favor of our rank test: Theorem The asymptotic relative efficiency (ARE) of φ N with respect to φ N under µ = n 1/2 τ, A, and (g 1,..., g p ) is of the form ARE = p j=1 w j(a,τ) c(g j ) p j=1 w j(a,τ), w j (A,τ) 0. g j t 3 t 6 t 12 N e 2 e 3 e Table: Various values of c(g j ).

92 Definition Inference Actually, c(g j ) 1 for all g j, which implies that φ N is always (asymptotically) more powerful than the Student test φ N!

93 Definition Inference Actually, c(g j ) 1 for all g j, which implies that φ N is always (asymptotically) more powerful than the Student test φ N! Our tests therefore dominate the Student ones both in terms of robustness and efficiency!

94 Definition Inference Remark: rather than Gaussian scores" as in T j = 1 n n i=1 ( Sign(Y ij )Φ 1 Rij ) +, n + 1 one can use (more robust) Wilcoxon scores T j := 3 n n R ij Sign(Y ij ) n + 1 i=1 or (even more robust) sign scores T j := 1 n n Sign(Y ij ). i=1

95 Definition Inference Efficiency is then not as good, as a price for the better robustness... g j t 3 t 6 t 12 N e 2 e 3 e 5 φ N test φw φ S Table: Various values of c(g j ) for our Gaussian, Wilcoxon, and sign tests.

96 Definition Inference Original data

97 Definition Inference 95% confidence zone Gaussian method

98 Definition Inference 95% confidence zone our φ N IC method

99 Definition Inference 95% confidence zone our φ W IC method

100 Definition Inference 95% confidence zone our φ S IC method

101 Definition Inference Original data

102 Definition Inference Contaminated data

103 Definition Inference 95% confidence zone Gaussian method

104 Definition Inference 95% confidence zone our φ N IC method

105 Definition Inference 95% confidence zone our φ W IC method

106 Definition Inference 95% confidence zone our φ S IC method

107 Conclusion Introduction Definition Inference provide quite flexible semiparametric models for multivariate statistics. Rank methods are efficient and robust alternatives to Gaussian methods.

108 Appendix References References I Oja, H., Sirkiä, S., & J. Eriksson (2006). Scatter matrices and independent component analysis, Austrian Journal of Statistics 35, Oja, H., Nordhausen, K., & D. Paindaveine (2007). Signed-rank tests for location in the symmetric independent component model, ECORE DP 2007/123. Submitted. Oja, H., Paindaveine, D., & S. Taskinen (2008). Parametric and nonparametric tests for multivariate independence in. Manuscript in preparation.

Invariant coordinate selection for multivariate data analysis - the package ICS

Invariant coordinate selection for multivariate data analysis - the package ICS Klaus Nordhausen 1 Hannu Oja 1 David E. Tyler 2 1 Tampere School of Public Health University of Tampere 2 Department of Statistics