Biomedical signal processing application of optimization methods for machine learning problems

Size: px

Start display at page:

Download "Biomedical signal processing application of optimization methods for machine learning problems"

Percival Steven Walton
6 years ago
Views:

1 Biomedical signal processing application of optimization methods for machine learning problems Fabian J. Theis Computational Modeling in Biology Institute of Bioinformatics and Systems Biology Helmholtz Zentrum München Grenoble, 16-Sep-28

2 Data mining cocktail-party problem

3 Data mining cocktail-party problem

4 Data mining cocktail-party problem

5 Neural Network Data mining cocktail-party problem W

6 Data mining mixture model x(t) = f(s(t)) estimate mixing process f and sources s(t) often linear f = A s(t) x(t) ŝ(t) W Neural Network

7 Outline 1 Supervised methods Motivation 1: classification Motivation 2: image segmentation Statistical decision theory 2 Clustering k-means Partitional clustering 3 Independent component analysis Sparse component analysis Nonlinear sparse component analysis 4

8 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Outline 1 Supervised methods Motivation 1: classification Motivation 2: image segmentation Statistical decision theory 2 Clustering k-means Partitional clustering 3 Independent component analysis Sparse component analysis Nonlinear sparse component analysis 4

9 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Motivation 1: classification data analysis: classification decide between (two or multiple) classes s(t) {, 1} learn by example? f g

10 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Neural networks

11 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Classification: example observations: immunological data set 3 cell parameters of 37 children with pulmonary diseases goal interpretation using supervised and unsupervised analysis disease classification into chronic bronchitis or interstitial lung disease CB ILD? cooperation with D. Hartl, Pediatric Immunology, Munich

Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Classification: example observations: immunological data set 3 cell parameters of 37 children with pulmonary

12 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Classification: example observations: immunological data set 3 cell parameters of 37 children with pulmonary diseases goal interpretation using supervised and unsupervised analysis disease classification into chronic bronchitis or interstitial lung disease CB ILD? cooperation with D. Hartl, Pediatric Immunology, Munich

13 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Data visualization & dimension reduction parameter interpretation?

14 CB(3) ILD(1) ILD(2) CB(1) ILD(3) CB(2) CB(1) CB(1) ILD(2) ILD(1) CB(1) ILD(1) CB(1) CB(1) ILD(1) ILD(2) ILD(1) ILD(1) ILD(2) CB(2) ILD(1) ILD(2) CB(1) no(2) O(1) no(3) x(1) x(2) O(1) x(3) O(2) no(1) O(1) x(2) x(1) O(1) x(1) O(1) O(1) x(1) x(2) x(1) x(1) x(2) O(2) x(1) x(2) O(1) Supervised methods Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Data visualization & dimension reduction d d d d d d d d d d d d 4 d 1.35 d 196 d 1 K means Clusters 9.29 CB(3) 4.44 d.623 visualization by self-organizing map network topology-preserving nonlinear dimension reduction/scaling detect new parameter dependencies

15 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Disease classification dimensionreducing network z(i) = B supervised Aunsup.x(i) results: down-scaling to 5 hidden neurons suffices classification rate of > 9% [Theis, Hartl, Krauss-Etschmann, Lang. Neural network signal analysis in immunology. Proc. ISSPA 23.]

16 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Disease classification dimensionreducing network z(i) = B supervised Aunsup.x(i) results: down-scaling to 5 hidden neurons suffices classification rate of > 9% [Theis, Hartl, Krauss-Etschmann, Lang. Neural network signal analysis in immunology. Proc. ISSPA 23.]

17 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Motivation 2: image segmentation classification application in image processing object classification? f g

18 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Motivation 2: image segmentation Problem: How many labelled cells lie in this section image?

Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Biological background: neurogenesis adult neurogenesis new neurons emerge even in the adult human brain

19 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Biological background: neurogenesis adult neurogenesis new neurons emerge even in the adult human brain level depends on external stimuli Are there neural ancestral cells? goal automated quantification of neurogenesis in adult mice cooperation with Z. Kohl, Department of Neurology, University of Regensburg

20 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Automated cell counting

21 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Automated cell counting directional neural network train cell patch classifier ζ using directional neural network scan image using ζ to get cell positions speed-up via hierarchical and multiscale methods

directional neural network scan image using ζ to

22 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Automated cell counting directional neural network train cell patch classifier ζ using directional neural network scan image using ζ to get cell positions speed-up via hierarchical and multiscale methods

Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Results counting comparison with 2 experts (variability ±5%) yields 9% ± 4% accuracy application:

23 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Results counting comparison with 2 experts (variability ±5%) yields 9% ± 4% accuracy application: considerable cell proliferation in hippocampus of epileptic mice [Theis, Kohl, Guggenberger, Kuhn, Lang. ZANE - an algorithm for counting labelled cells in section images. Proc. MEDSIP 24]

24 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decision theory setup input: random vector X : Ω R p output: random vector Y : Ω R or categorical output, possibly Y {, 1} input-output relation measured by joint density P(X, Y ) realization by samples (training data) (x i, y i ) for i = 1,..., N often collected in (N p)-matrix X and vector y R N

25 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Goal: prediction 1.5 goal: learn classificator from training data predict y for new sample x

26 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Linear model p ŷ = ˆβ + x j ˆβj j=1 set x := 1, then ŷ = x β least squares: minimize RSS(β) = X (y Xβ) = so N i=1 (y i x i β) 2 = (y Xβ) (y Xβ) ˆβ = (X X) 1 X y

27 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Linear model p ŷ = ˆβ + x j ˆβj j=1 set x := 1, then ŷ = x β least squares: minimize RSS(β) = X (y Xβ) = so N i=1 (y i x i β) 2 = (y Xβ) (y Xβ) ˆβ = (X X) 1 X y

28 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Linear model p ŷ = ˆβ + x j ˆβj j=1 set x := 1, then ŷ = x β least squares: minimize RSS(β) = X (y Xβ) = so N i=1 (y i x i β) 2 = (y Xβ) (y Xβ) ˆβ = (X X) 1 X y

29 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Linear model p ŷ = ˆβ + x j ˆβj j=1 set x := 1, then ŷ = x β least squares: minimize RSS(β) = X (y Xβ) = so N i=1 (y i x i β) 2 = (y Xβ) (y Xβ) ˆβ = (X X) 1 X y

30 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Linear model decision boundary {x x β = 1/2}

31 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Linear model nice, but what about more complex data? (r = 2 and r = 1 Gaussians per class, σ =.2, with r means sampled from N((1, ), I and N((, 1), I), respectively)

32 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Linear model hm? global, linear model is too rigid

33 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Nearest-neighbor method ŷ = 1 k x i N k (x) if N k (x) equal the k closest points x i to x local model needs metric (here Euclidean) how to determine k? smaller k higher learning accuracy larger k smoother, higher generalizability least-square learning would yield k = 1 y i

34 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Nearest-neighbor method ŷ = 1 k x i N k (x) if N k (x) equal the k closest points x i to x local model needs metric (here Euclidean) how to determine k? smaller k higher learning accuracy larger k smoother, higher generalizability least-square learning would yield k = 1 y i

35 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Nearest-neighbor method, k = decision boundary {x ŷ(x) = 1/2}

36 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Nearest-neighbor method, k = 1, 2,

37 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions probabilistic view: P(X, Y ) = P(Y X )P(X ) find function f (X ) predicting Y as well as possible w.r.t. squared error loss L(Y, f (X )) = (Y f (X )) 2 expected prediction error EPE(f ) = E(Y f (X )) 2 = (y f (x)) 2 P(dx, dy) = E X E Y X ((Y f (X )) 2 X ) pointwise minimization suffices f (x) = argmin c E Y X ((Y c) 2 X = x) solved at conditional expectation (regression function) f (x) = E(Y X = x)

38 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions probabilistic view: P(X, Y ) = P(Y X )P(X ) find function f (X ) predicting Y as well as possible w.r.t. squared error loss L(Y, f (X )) = (Y f (X )) 2 expected prediction error EPE(f ) = E(Y f (X )) 2 = (y f (x)) 2 P(dx, dy) = E X E Y X ((Y f (X )) 2 X ) pointwise minimization suffices f (x) = argmin c E Y X ((Y c) 2 X = x) solved at conditional expectation (regression function) f (x) = E(Y X = x)

39 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions can be estimated by f (x) = E(Y X = x) ˆf (x) = 1 k x i N k (x) y i approximate expectation via sample averages approximate point conditioning to local conditioning note: ˆf (x) E(Y X = x) for N, K, k/n but: (very) finite samples curse of dimensionality fraction r of unit cube in p dimensions is covered by cube of edge length e p(r) = r 1/p e 2 (.1) =.1, e 2 (.1) =.32 e 1 (.1) =.63, e 1 (.1) =.8

40 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions can be estimated by f (x) = E(Y X = x) ˆf (x) = 1 k x i N k (x) y i approximate expectation via sample averages approximate point conditioning to local conditioning note: ˆf (x) E(Y X = x) for N, K, k/n but: (very) finite samples curse of dimensionality fraction r of unit cube in p dimensions is covered by cube of edge length e p(r) = r 1/p e 2 (.1) =.1, e 2 (.1) =.32 e 1 (.1) =.63, e 1 (.1) =.8

41 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions can be estimated by f (x) = E(Y X = x) ˆf (x) = 1 k x i N k (x) y i approximate expectation via sample averages approximate point conditioning to local conditioning note: ˆf (x) E(Y X = x) for N, K, k/n but: (very) finite samples curse of dimensionality fraction r of unit cube in p dimensions is covered by cube of edge length e p(r) = r 1/p e 2 (.1) =.1, e 2 (.1) =.32 e 1 (.1) =.63, e 1 (.1) =.8

42 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions if instead for approximating f (x) = E(Y X = x), we assume linear model f (x) = x β, we get β = E(XX ) 1 E(XY ) no conditioning, global approximation

43 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions if instead for approximating f (x) = E(Y X = x), we assume linear model f (x) = x β, we get β = E(XX ) 1 E(XY ) no conditioning, global approximation

44 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions for discrete Y if Y {, 1}, consider loss function { if f(x)=y L(Y, f (X )) = 1 otherwise then EPE = E X y {,1} L(y, f (X ))P(y X ) and hence Ŷ (x) = argmin y {,1} which yields the Bayes classifier question: how to model P(Y X )? y {,1} L(y, y )P(y X = x) = argmin y {,1} 1 P(y X = x) Ŷ (x) = argmax y P(y X = x)

45 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions for discrete Y if Y {, 1}, consider loss function { if f(x)=y L(Y, f (X )) = 1 otherwise then EPE = E X y {,1} L(y, f (X ))P(y X ) and hence Ŷ (x) = argmin y {,1} which yields the Bayes classifier question: how to model P(Y X )? y {,1} L(y, y )P(y X = x) = argmin y {,1} 1 P(y X = x) Ŷ (x) = argmax y P(y X = x)

46 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions for discrete Y if Y {, 1}, consider loss function { if f(x)=y L(Y, f (X )) = 1 otherwise then EPE = E X y {,1} L(y, f (X ))P(y X ) and hence Ŷ (x) = argmin y {,1} which yields the Bayes classifier question: how to model P(Y X )? y {,1} L(y, y )P(y X = x) = argmin y {,1} 1 P(y X = x) Ŷ (x) = argmax y P(y X = x)

47 Bayes classifier results Supervised methods Motivation 1: classification Motivation 2: image segmentation Statistical decision theory

48 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Method combinations nonlinear models e.g. f (x) = p j=1 f j(x j ) or basis expansion f (x) = j h j(x)β j with polynomial, Fourier or sigmoidal bases ( neural networks) prediction/function approximation by maximum-likelihood estimation of parameters enhance generalizability by adding regularization term +λj(f ) to RSS(f ) for f from some function class generalize inner-product methods to nonlinear situations by high-dimensional embedding x Φ(x) and kernels k(x, x ) = Φ(x) Φ(x)

49 Clustering k-means Partitional clustering Outline 1 Supervised methods Motivation 1: classification Motivation 2: image segmentation Statistical decision theory 2 Clustering k-means Partitional clustering 3 Independent component analysis Sparse component analysis Nonlinear sparse component analysis 4

50 Clustering k-means Partitional clustering Clustering explanation by example goal: differentiate hand-written digits 2 and 4 given a set of unknown gray-scale images of 2s and 4s, find the subset of 2s and the subset of 4s unsupervised learning by example

51 Clustering k-means Partitional clustering Clustering explanation by example goal: differentiate hand-written digits 2 and 4 given a set of unknown gray-scale images of 2s and 4s, find the subset of 2s and the subset of 4s versus unsupervised learning by example

52 Clustering k-means Partitional clustering Clustering explanation by example goal: differentiate hand-written digits 2 and 4 given a set of unknown gray-scale images of 2s and 4s, find the subset of 2s and the subset of 4s like a baby: versus unsupervised learning by example

53 Clustering k-means Partitional clustering Example data set here: machine learning i.e. statistical approach needs many test cases: dimension reduction via PCA to only 2 dimensions here 1 28x28 images each interpret each 28x28-image as element of R 784 :

54 Clustering k-means Partitional clustering Example data set here: machine learning i.e. statistical approach needs many test cases: dimension reduction via PCA to only 2 dimensions here 1 28x28 images each interpret each 28x28-image as element of R 784 :

55 Clustering k-means Partitional clustering Example data set here: machine learning i.e. statistical approach needs many test cases: dimension reduction via PCA to only 2 dimensions here 1 28x28 images each interpret each 28x28-image as element of R 784 :......

56 Clustering k-means Partitional clustering Example data set here: machine learning i.e. statistical approach needs many test cases: dimension reduction via PCA to only 2 dimensions here 1 28x28 images each interpret each 28x28-image as element of R 784 :......

57 Clustering k-means Partitional clustering Example data set here: machine learning i.e. statistical approach needs many test cases: dimension reduction via PCA to only 2 dimensions here 1 28x28 images each interpret each 28x28-image as element of R 784 :

58 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i Px C i d(x i, c i ) 2 [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

59 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i Px C i d(x i, c i ) 2 Samples Centroids [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

60 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 batch k-means Aufteilung [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

61 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 batch k-means Zuweisung [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

62 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 batch k-means [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

63 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 sequentieller k-means beliebiges Sample [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

64 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 sequentieller k-means nächster Centroid [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

65 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 sequentieller k-means Update [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

66 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 sequentieller k-means [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

67 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 sequentieller k-means [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

68 Clustering k-means Partitional clustering Batch k-means 3 k means after 1 iteration done: error 4.5%

69 Clustering k-means Partitional clustering Batch k-means 3 k means after 2 iterations done: error 4.5%

70 Clustering k-means Partitional clustering Batch k-means 3 k means after 3 iterations done: error 4.5%

71 Clustering k-means Partitional clustering Batch k-means 3 k means after 4 iterations done: error 4.5%

72 Clustering k-means Partitional clustering Batch k-means 3 k means after 5 iterations done: error 4.5%

73 Clustering k-means Partitional clustering Batch k-means 3 k means after 6 iterations done: error 4.5%

74 Clustering k-means Partitional clustering Batch k-means 3 k means after 7 iterations done: error 4.5%

75 Clustering k-means Partitional clustering Partitional clustering goal: given a set A of points in metric space (M, d) find partition of A into B i, S i B i = A, and centroids c i M minimizing E(B 1, c 1,..., B k, c k ) := kx X i=1 a B i d(a, c i ) 2. (1) A = {a 1,..., a T } constrained non-linear opt. problem minimize kx TX E(W, C) := w it d(a i, c i ) 2. (2) subject to w it {, 1}, i=1 t=1 kx w it = 1 for 1 i k, 1 t T. (3) i=1 C := {c 1,..., c k } centroid locations, W := (w it ) partition matrix

76 Clustering k-means Partitional clustering Partitional clustering goal: given a set A of points in metric space (M, d) find partition of A into B i, S i B i = A, and centroids c i M minimizing E(B 1, c 1,..., B k, c k ) := kx X i=1 a B i d(a, c i ) 2. (1) A = {a 1,..., a T } constrained non-linear opt. problem minimize kx TX E(W, C) := w it d(a i, c i ) 2. (2) subject to w it {, 1}, i=1 t=1 kx w it = 1 for 1 i k, 1 t T. (3) i=1 C := {c 1,..., c k } centroid locations, W := (w it ) partition matrix

77 Clustering k-means Partitional clustering Minimize this! common approach: partial optimization for W and C alternate minimization of either W and C while keeping the other one fixed batch k-means algorithm initial random choice of centroids c 1,..., c k iterate until convergence: cluster assignment: for each a t determine an index i(t) such that i(t) = argmin i d(a t, c i ) cluster update: within each cluster B i := {a t i(t) = i} determine the centroid c i by minimizing c i := argmin c X a B i d(a, c) 2 convergence to local minimum (??)

78 Clustering k-means Partitional clustering Minimize this! common approach: partial optimization for W and C alternate minimization of either W and C while keeping the other one fixed batch k-means algorithm initial random choice of centroids c 1,..., c k iterate until convergence: cluster assignment: for each a t determine an index i(t) such that i(t) = argmin i d(a t, c i ) cluster update: within each cluster B i := {a t i(t) = i} determine the centroid c i by minimizing c i := argmin c X a B i d(a, c) 2 convergence to local minimum (??)

79 Clustering k-means Partitional clustering Minimize this! common approach: partial optimization for W and C alternate minimization of either W and C while keeping the other one fixed batch k-means algorithm initial random choice of centroids c 1,..., c k iterate until convergence: cluster assignment: for each a t determine an index i(t) such that i(t) = argmin i d(a t, c i ) cluster update: within each cluster B i := {a t i(t) = i} determine the centroid c i by minimizing c i := argmin c X a B i d(a, c) 2 convergence to local minimum (??)

80 Clustering k-means Partitional clustering Euclidean case special case: M := R n and the Euclidean distance d(x, y) := x y centroids can be calculated in closed form: centroid is given by the cluster mean c i := (1/ B i ) X a B i a this follows directly from X X a c i 2 = a B i a B i j=1 nx (a j c ij ) 2 = nx X j=1 a B i (a 2 j 2a j c ij + c 2 ij )

81 Clustering k-means Partitional clustering Euclidean case special case: M := R n and the Euclidean distance d(x, y) := x y centroids can be calculated in closed form: centroid is given by the cluster mean c i := (1/ B i ) X a B i a this follows directly from X X a c i 2 = a B i a B i j=1 nx (a j c ij ) 2 = nx X j=1 a B i (a 2 j 2a j c ij + c 2 ij )

82 Clustering k-means Partitional clustering Euclidean case special case: M := R n and the Euclidean distance d(x, y) := x y centroids can be calculated in closed form: centroid is given by the cluster mean c i := (1/ B i ) X a B i a this follows directly from X X a c i 2 = a B i a B i j=1 nx (a j c ij ) 2 = nx X j=1 a B i (a 2 j 2a j c ij + c 2 ij )

83 Clustering k-means Partitional clustering Extensions c i := argmin c a B i d(a, c) p more difficult optimization problems: non-euclidean spaces e.g. RP n or Grassmann manifolds extensions from p = 2 to e.g. p = 1 or p < p = 1 corresponds to finding the spatial median of B i

84 Clustering k-means Partitional clustering Extensions c i := argmin c a B i d(a, c) p more difficult optimization problems: non-euclidean spaces e.g. RP n or Grassmann manifolds extensions from p = 2 to e.g. p = 1 or p < p = 1 corresponds to finding the spatial median of B i

85 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Outline 1 Supervised methods Motivation 1: classification Motivation 2: image segmentation Statistical decision theory 2 Clustering k-means Partitional clustering 3 Independent component analysis Sparse component analysis Nonlinear sparse component analysis 4

auditory cortex auditory cortex 2 word detection decision [Keck, Theis, Gruber, Lang,

86 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Independent component analysis example: Cocktail party problem of the brain auditory cortex auditory cortex 2 word detection decision [Keck, Theis, Gruber, Lang, Specht, Puntonet. 3D spatial analysis of fmri data on a word perception task. LNCS, 3195: ]

87 Independent component analysis Sparse component analysis Nonlinear sparse component analysis BSS model Blind source separation (BSS) problem x(t) = As(t) + ɛ(t) x(t) observed m-dimensional random vector A (unknown) full-rank m n matrix s(t) (unknown) n-dimensional source signals (here: n m) ɛ(t) (unknown) white noise goal: given x, recover A and s! additional assumptions necessary stochastically independent s(t): p s(s 1,..., s n) = p s1 (s 1)... p sn (s n) independent component analysis (ICA) sparse source signals s i (t) sparse component analysis (SCA) nonnegative s and A nonnegative matrix factorization (NMF)

88 Independent component analysis Sparse component analysis Nonlinear sparse component analysis BSS model Blind source separation (BSS) problem x(t) = As(t) + ɛ(t) x(t) observed m-dimensional random vector A (unknown) full-rank m n matrix s(t) (unknown) n-dimensional source signals (here: n m) ɛ(t) (unknown) white noise goal: given x, recover A and s! additional assumptions necessary stochastically independent s(t): p s(s 1,..., s n) = p s1 (s 1)... p sn (s n) independent component analysis (ICA) sparse source signals s i (t) sparse component analysis (SCA) nonnegative s and A nonnegative matrix factorization (NMF)

89 Independent component analysis Sparse component analysis Nonlinear sparse component analysis important questions in data analysis model? (restrictions to A and s) indeterminacies of the model? algorithmic identification given x? identifiability obvious indeterminacies: scaling L and permutation P Theorem Let the independent random vector s L 2 contain at most one gaussian component. Given two ICA solutions As = A s, then A = A LP. Note: theorem does not hold for gaussian sources s. [Theis. A new concept for separability problems in blind source separation. Neural Computation, 24]

90 Independent component analysis Sparse component analysis Nonlinear sparse component analysis important questions in data analysis model? (restrictions to A and s) indeterminacies of the model? algorithmic identification given x? identifiability obvious indeterminacies: scaling L and permutation P Theorem Let the independent random vector s L 2 contain at most one gaussian component. Given two ICA solutions As = A s, then A = A LP. Note: theorem does not hold for gaussian sources s. [Theis. A new concept for separability problems in blind source separation. Neural Computation, 24]

91 Independent component analysis Sparse component analysis Nonlinear sparse component analysis important questions in data analysis model? (restrictions to A and s) indeterminacies of the model? algorithmic identification given x? identifiability obvious indeterminacies: scaling L and permutation P Theorem Let the independent random vector s L 2 contain at most one gaussian component. Given two ICA solutions As = A s, then A = A LP. Note: theorem does not hold for gaussian sources s. [Theis. A new concept for separability problems in blind source separation. Neural Computation, 24]

92 Independent component analysis Sparse component analysis Nonlinear sparse component analysis important questions in data analysis model? (restrictions to A and s) indeterminacies of the model? algorithmic identification given x? identifiability obvious indeterminacies: scaling L and permutation P Theorem Let the independent random vector s L 2 contain at most one gaussian component. Given two ICA solutions As = A s, then A = A LP Note: theorem does not hold for gaussian sources s [Theis. A new concept for separability problems in blind source separation. Neural Computation, 24]

93 Independent component analysis Sparse component analysis Nonlinear sparse component analysis ICA algorithms basic scheme of ICA algorithms (case m = n) search for invertible demixing matrix W that minimizes some dependence measure of Wx some contrasts minimize mutual information I (Wx) (?) maximize neural network output entropy H(f (Wx)) (?) extend PCA by performing nonlinear decorrelation (?) maximize non-gaussianity of output components (Wx) i (?) minimize off-diagonal error of H ln pwx minimize median deviation of Wx [Theis et al. Linear geometric ICA: Fundamentals and algorithms. Neural Computation, 23] [Theis, Lang, Puntonet. A geometric algorithm for overcomplete linear ICA. Neurocomputing, 24]

94 Independent component analysis Sparse component analysis Nonlinear sparse component analysis ICA algorithms basic scheme of ICA algorithms (case m = n) search for invertible demixing matrix W that minimizes some dependence measure of Wx some contrasts minimize mutual information I (Wx) (?) maximize neural network output entropy H(f (Wx)) (?) extend PCA by performing nonlinear decorrelation (?) maximize non-gaussianity of output components (Wx) i (?) minimize off-diagonal error of H ln pwx minimize median deviation of Wx [Theis et al. Linear geometric ICA: Fundamentals and algorithms. Neural Computation, 23] [Theis, Lang, Puntonet. A geometric algorithm for overcomplete linear ICA. Neurocomputing, 24]

95 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Optimization problem: minimize cost function f (W) on Gl(n) or O(n) often: gradient descent: W f (W) in high dimensions: simulated annealing or genetic algorithms use non-euclidean structure of Gl(n) Euclidean gradient not compatible with group Gl(n) define natural gradient nat f (W) = euc f (W)W W considerable performance increase [Stadlthanner, Theis, Puntonet, Lang. Extended sparse nonnegative matrix factorization. LNCS, 3512: ] [Squartini, Theis. New Riemannian metrics for speeding-up the convergence of over- and underdetermined ICA. In preparation] [Theis. Gradients on matrix manifolds and their chain rule. Submitted to NIPS LR, 25]

96 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Optimization problem: minimize cost function f (W) on Gl(n) or O(n) often: gradient descent: W f (W) in high dimensions: simulated annealing or genetic algorithms use non-euclidean structure of Gl(n) Euclidean gradient not compatible with group Gl(n) define natural gradient nat f (W) = euc f (W)W W considerable performance increase [Stadlthanner, Theis, Puntonet, Lang. Extended sparse nonnegative matrix factorization. LNCS, 3512: ] [Squartini, Theis. New Riemannian metrics for speeding-up the convergence of over- and underdetermined ICA. In preparation] [Theis. Gradients on matrix manifolds and their chain rule. Submitted to NIPS LR, 25]

Independent component analysis Sparse component analysis Nonlinear sparse component analysis fmri analysis function magnetic resonance imaging noninvasive brain

97 Independent component analysis Sparse component analysis Nonlinear sparse component analysis fmri analysis function magnetic resonance imaging noninvasive brain imaging technique information on brain activation patterns activation maps help identifying task-related brain regions BSS techniques for fmri possible, see (?).

Independent component analysis Sparse component analysis Nonlinear sparse component analysis fmri analysis spatial-only BSS function magnetic resonance imaging

98 Independent component analysis Sparse component analysis Nonlinear sparse component analysis fmri analysis spatial-only BSS function magnetic resonance imaging noninvasive brain imaging technique information on brain activation patterns activation maps help identifying task-related brain regions BSS techniques for fmri possible, see (?).

99 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Experimental setup experiment block design protocol: 5 time instants of visual stimulation 5 instants of rest 1 scans taking 3s each data set well known design expected activity in visual cortex here: use only a single horizontal slice preprocessing motion correction smoothing data acquired by D. Auer, MPI of Psychiatry, Munich

100 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Results cc:.18 2 cc: cc:.5 4 cc:.9 (a) spatial sources s S (b) temporal sources t S component 2 partially represents the frontal eye fields component 4: stimulus component, cc =.9 with stimulus [Theis, Gruber, Keck, Lang. Functional MRI analysis by a novel spatiotemporal ICA algorithm. LNCS 3696: ]

101 Independent component analysis Sparse component analysis Nonlinear sparse component analysis

102 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Why extend ICA? identifiability of ICA only holds if data follows generative model with independent sources simulation apply ICA to data not fulfilling the ICA model here sources consist of a 2d- and a 1-d irreducible component plot Amari-error over 1 runs.

103 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Why extend ICA? identifiability of ICA only holds if data follows generative model with independent sources simulation apply ICA to data not fulfilling the ICA model here sources consist of a 2d- and a 1-d irreducible component plot Amari-error over 1 runs crosstalking error FastICA JADE Extended Infomax result: no recovery of mixing matrix

104 Independent component analysis Sparse component analysis Nonlinear sparse component analysis require stochastic independence only between groups of source components nk-dimensional S is to be k-independent i.e. 1 1 S 1. S k C B S nk k+1. S nk mutually independent independent subspace analysis (ISA) recent result: extension to arbitrary group-size major advantage: general independent subspace analysis (ISA) always exists C A [Theis. Uniqueness of complex and multidimensional independent component analysis. Signal Processing, 24]

105 Independent component analysis Sparse component analysis Nonlinear sparse component analysis require stochastic independence only between groups of source components nk-dimensional S is to be k-independent i.e. 1 1 S 1. S k C B S nk k+1. S nk mutually independent independent subspace analysis (ISA) recent result: extension to arbitrary group-size major advantage: general independent subspace analysis (ISA) always exists C A [Theis. Uniqueness of complex and multidimensional independent component analysis. Signal Processing, 24]

106 Independent component analysis Sparse component analysis Nonlinear sparse component analysis PCA X A S

107 Independent component analysis Sparse component analysis Nonlinear sparse component analysis ICA X A S L P

108 Independent component analysis Sparse component analysis Nonlinear sparse component analysis ISA with fixed groupsize X A S L P

109 Independent component analysis Sparse component analysis Nonlinear sparse component analysis General ISA X A S L P

110 Independent component analysis Sparse component analysis Nonlinear sparse component analysis ISA framework Definition Y independent component of X : X = A(Y, Z) such that Y and Z are stochastically independent. Definition (general ISA) S is irreducible if it contains no lower-dim. independent cpt. W Gl(n) independent subspace analysis of X : WX = (S 1,..., S k ) with pairwise independent, irreducible S i Theorem Given a random vector X with existing covariance, then an ISA of X exists and is unique except for scaling and permutation.

111 Independent component analysis Sparse component analysis Nonlinear sparse component analysis ISA framework Definition Y independent component of X : X = A(Y, Z) such that Y and Z are stochastically independent. Definition (general ISA) S is irreducible if it contains no lower-dim. independent cpt. W Gl(n) independent subspace analysis of X : WX = (S 1,..., S k ) with pairwise independent, irreducible S i Theorem Given a random vector X with existing covariance, then an ISA of X exists and is unique except for scaling and permutation.

112 Independent component analysis Sparse component analysis Nonlinear sparse component analysis ISA framework Definition Y independent component of X : X = A(Y, Z) such that Y and Z are stochastically independent. Definition (general ISA) S is irreducible if it contains no lower-dim. independent cpt. W Gl(n) independent subspace analysis of X : WX = (S 1,..., S k ) with pairwise independent, irreducible S i Theorem Given a random vector X with existing covariance, then an ISA of X exists and is unique except for scaling and permutation.

113 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Algebraic ISA algorithms main idea: source condition matrices C i (S) are block-diagonal subspace JADE after whitening assume orthogonal A group-independence of S: contracted quadricovariance matrices C ij (S) are block-diagonal perform joint block diagonalization of {C ij (X)} to get A for general ISA, estimate block-structure after diagonalization C ij (S) = A C ij (X) A [Theis. Towards a general independent subspace analysis. NIPS 26 accepted]

114 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Algebraic ISA algorithms main idea: source condition matrices C i (S) are block-diagonal subspace JADE after whitening assume orthogonal A group-independence of S: contracted quadricovariance matrices C ij (S) are block-diagonal perform joint block diagonalization of {C ij (X)} to get A for general ISA, estimate block-structure after diagonalization C ij (S) = A C ij (X) A [Theis. Towards a general independent subspace analysis. NIPS 26 accepted]

115 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Joint Block Diagonalization with unknown block-sizes Joint Block Diagonalization (JBD) given n n-matrices C 1,..., C K and a partition m, m m r = n of n goal: find orthogonal A such that k: A C k A is m-block-diagonal minimize (e.g. by applying iterative Givens-rotations) K f m (Â) := Â C k Â diagm m (Â C k Â) 2 F k=1 unknown blocksize m general JBD then searches for maximal-length block structure i.e. (A, m) = argmax m A:f m (A)= m result: JBD by JD: any block-optimal JBD i.e. zero of f m is a local minimum of ordinary joint diagonalization.

116 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Joint Block Diagonalization with unknown block-sizes Joint Block Diagonalization (JBD) given n n-matrices C 1,..., C K and a partition m, m m r = n of n goal: find orthogonal A such that k: A C k A is m-block-diagonal minimize (e.g. by applying iterative Givens-rotations) K f m (Â) := Â C k Â diagm m (Â C k Â) 2 F k=1 unknown blocksize m general JBD then searches for maximal-length block structure i.e. (A, m) = argmax m A:f m (A)= m result: JBD by JD: any block-optimal JBD i.e. zero of f m is a local minimum of ordinary joint diagonalization.

117 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Joint Block Diagonalization with unknown block-sizes Joint Block Diagonalization (JBD) given n n-matrices C 1,..., C K and a partition m, m m r = n of n goal: find orthogonal A such that k: A C k A is m-block-diagonal minimize (e.g. by applying iterative Givens-rotations) K f m (Â) := Â C k Â diagm m (Â C k Â) 2 F k=1 unknown blocksize m general JBD then searches for maximal-length block structure i.e. (A, m) = argmax m A:f m (A)= m result: JBD by JD: any block-optimal JBD i.e. zero of f m is a local minimum of ordinary joint diagonalization.

118 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Joint Block Diagonalization with unknown block-sizes Joint Block Diagonalization (JBD) given n n-matrices C 1,..., C K and a partition m, m m r = n of n goal: find orthogonal A such that k: A C k A is m-block-diagonal minimize (e.g. by applying iterative Givens-rotations) K f m (Â) := Â C k Â diagm m (Â C k Â) 2 F k=1 unknown blocksize m general JBD then searches for maximal-length block structure i.e. (A, m) = argmax m A:f m (A)= m result: JBD by JD: any block-optimal JBD i.e. zero of f m is a local minimum of ordinary joint diagonalization.

119 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Example (unknown) C 1 Â A w/o rec. P Â A. performance of the proposed general JBD (unknown) block-partition 4 = additive noise with SNR of 5dB, K = 1 matrices result: estimate Â equals A after permutation recovery

120 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Extraction of fetal electrocardiograms separate fetal ECG (FECG) recordings from the mother s ECG (MECG) apply Hessian-based MICA with k = 2 and 5 Hessians

121 Independent component analysis Sparse component analysis Nonlinear sparse component analysis (a) ECG recordings (b) extracted sources (c) MECG part (d) FECG part

122 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Sparse component analysis sparse [Theis, Puntonet, Lang. Median-based clustering for underdetermined blind signal processing. IEEE SPL, 25]

123 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Model Sparse Component Analysis (SCA) problem x(t) = As(t) observed mixtures x(t) R m A (unknown) real matrix with linearly independent columns s(t) (unknown) (m 1)-sparse sources s(t) R n i.e. s(t) has at most (m 1) non-zeros goal: recover unknown A and s(t) given only x(t) Theorem If s(t) is (m 1)-sparse and A and s(t) in general position, both A and s(t) are identifiable (except for scaling and permutation). [Georgiev, Theis, Cichocki. Sparse component analysis and blind source separation of underdetermined mixtures. IEEE TNN, 25]

124 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Model Sparse Component Analysis (SCA) problem x(t) = As(t) observed mixtures x(t) R m A (unknown) real matrix with linearly independent columns s(t) (unknown) (m 1)-sparse sources s(t) R n i.e. s(t) has at most (m 1) non-zeros goal: recover unknown A and s(t) given only x(t) Theorem If s(t) is (m 1)-sparse and A and s(t) in general position, both A and s(t) are identifiable (except for scaling and permutation). [Georgiev, Theis, Cichocki. Sparse component analysis and blind source separation of underdetermined mixtures. IEEE TNN, 25]

125 Independent component analysis Sparse component analysis Nonlinear sparse component analysis SCA algorithm matrix identification by multiple hyperplane detection e.g. using Hough transform robust against outliers and noise source recovery using sparsity and known matrix [Theis, Georgiev, Cichocki. Robust sparse component analysis based on a generalized Hough transform. Signal Processing 26]

126 Independent component analysis Sparse component analysis Nonlinear sparse component analysis SCA of surface electromyograms electromyogram (EMG): electric signal generated by a contracting muscle surface EMG: non-invasive, however source overlaps cooperation with G. García, Bioinformatic Engineering, Osaka

127 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Results source and SCA recovery within 8 artificial, dependent mixtures results on toy data: sparseness works as separation criterion real data relative semg enhancement 24.6 ± 21.4% (mean over 9 subjects) beats standard signal processing and ICA [Theis, García. On the use of sparse signal decomposition in the analysis of multi-channel surface EMGs. Signal Processing, 26]

128 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Results source and SCA recovery within 8 artificial, dependent mixtures results on toy data: sparseness works as separation criterion real data relative semg enhancement 24.6 ± 21.4% (mean over 9 subjects) beats standard signal processing and ICA [Theis, García. On the use of sparse signal decomposition in the analysis of multi-channel surface EMGs. Signal Processing, 26]

129 Independent component analysis Sparse component analysis Nonlinear sparse component analysis SCA of functional MRI data cc:.16 2 cc:.28 3 cc: cc:.4 5 cc:.88 component maps (S) time courses (A) complete SCA was performed using k-means hyperplane clustering components 2 and 3 represents inner ventricles, component 1 contains the frontal eye fields component 5 is desired visual stimulus component active in the visual cortex (crosscorrelation with stimulus cc =.88 fastica yields similar cc =.9)

Independent component analysis Sparse component analysis Nonlinear sparse component analysis SCA of functional MRI data 1 2 3 1 cc:.16 2 cc:.28 3 cc:.13 4 5 4 cc:.4 5 cc:.

130 Independent component analysis Sparse component analysis Nonlinear sparse component analysis SCA of functional MRI data cc:.16 2 cc:.28 3 cc: cc:.4 5 cc:.88 component maps (S) time courses (A) complete SCA was performed using k-means hyperplane clustering components 2 and 3 represents inner ventricles, component 1 contains the frontal eye fields component 5 is desired visual stimulus component active in the visual cortex (crosscorrelation with stimulus cc =.88 fastica yields similar cc =.9)

131 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinear SCA Given m-dimensional random vector x, find representation x = f(as) with unknown n-dim. random vector s (sources) m n-matrix A (mixing matrix) diagonal invertible f = f 1... f m (postnonlinearities) postnonlinear ICA s independent (see (?)) here: SCA model s is (m 1)-sparse

132 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Overcomplete postnonlinear cocktail-party problem

133 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Overcomplete postnonlinear cocktail-party problem

134 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Overcomplete postnonlinear cocktail-party problem

135 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity identification lemma Given an invertible 2 2-matrix A, define L at as L := A([, ɛ) {} {} [, ɛ)). Lemma If a diagonal analytic diffeomorphism h := h 1 h 2 maps an L (in general position ) at again on an L at, then it is a linear scaling. h

136 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Identifiability due to linear identifiability it is enough show that if f(as) = ˆf(Âŝ) then h = ˆf 1 f is linear scaling case m = 2: image of As and Âŝ are finite unions of L s, so this follows from previous lemma h h

137 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Identifiability due to linear identifiability it is enough show that if f(as) = ˆf(Âŝ) then h = ˆf 1 f is linear scaling case m = 2: image of As and Âŝ are finite unions of L s, so this follows from previous lemma h h

138 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Identifiability due to linear identifiability it is enough show that if f(as) = ˆf(Âŝ) then h = ˆf 1 f is linear scaling case m = 2: image of As and Âŝ are finite unions of L s, so this follows from previous lemma h h

139 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Identifiability: proof A f R 3 R 2 A f R 2 R 3 R 2

140 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Algorithm multistage separation algorithm: find separating nonlinearities g estimate mixing matrix Â of linearized model g(x) estimate sources given Â and g(x) how can g be found algorithmically?

141 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Algorithm multistage separation algorithm: find separating nonlinearities g estimate mixing matrix Â of linearized model g(x) estimate sources given Â and g(x) how can g be found algorithmically?

142 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection for simplicity assume m = 2. geometrical preprocessing: determine two 1-dimensional submanifolds in the image of x find curves y(t) and z(t) in R 2 which are mapped onto an L by g. simple method: choose arbitrary starting points y(t 1) and z(t 1) among samples of x iteratively pick closest sample to previous y(t i 1 ) resp. z(t i 1 ) with smaller modulus

143 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection for simplicity assume m = 2. geometrical preprocessing: determine two 1-dimensional submanifolds in the image of x find curves y(t) and z(t) in R 2 which are mapped onto an L by g. simple method: choose arbitrary starting points y(t 1) and z(t 1) among samples of x iteratively pick closest sample to previous y(t i 1 ) resp. z(t i 1 ) with smaller modulus

144 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection for simplicity assume m = 2. geometrical preprocessing: determine two 1-dimensional submanifolds in the image of x find curves y(t) and z(t) in R 2 which are mapped onto an L by g. simple method: choose arbitrary starting points y(t 1) and z(t 1) among samples of x iteratively pick closest sample to previous y(t i 1 ) resp. z(t i 1 ) with smaller modulus

145 Supervised methods Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection F. Theis Biomedical signal processing application of optimization methods for machi

146 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Supervised methods Postnonlinearity detection f A F. Theis Biomedical signal processing application of optimization methods for machi

147 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Supervised methods Postnonlinearity detection f A mixture density F. Theis Biomedical signal processing application of optimization methods for machi

148 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Supervised methods Postnonlinearity detection f A mixture density geometrical preprocessing F. Theis Biomedical signal processing application of optimization methods for machi

149 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection reparametrization (ȳ := y y 1 1 ) of the curves gives y 1 = z 1 = id. hence g y = (g 1, ag 1 ) and g z = (g 1, bg 1 ) g 2 y 2 = ag 1 = a b g 2 z 2 analytical geometrical postnonlinearity detection: find analytical 1d diffeomorphism g with g y = cg z for c, ±1 and given curves y, z : ( 1, 1) R with y() = z() =. note c = y ()/z ()

150 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection reparametrization (ȳ := y y 1 1 ) of the curves gives y 1 = z 1 = id. hence g y = (g 1, ag 1 ) and g z = (g 1, bg 1 ) g 2 y 2 = ag 1 = a b g 2 z 2 analytical geometrical postnonlinearity detection: find analytical 1d diffeomorphism g with g y = cg z for c, ±1 and given curves y, z : ( 1, 1) R with y() = z() =. note c = y ()/z ()

151 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection reparametrization (ȳ := y y 1 1 ) of the curves gives y 1 = z 1 = id. hence g y = (g 1, ag 1 ) and g z = (g 1, bg 1 ) g 2 y 2 = ag 1 = a b g 2 z 2 analytical geometrical postnonlinearity detection: find analytical 1d diffeomorphism g with g y = cg z for c, ±1 and given curves y, z : ( 1, 1) R with y() = z() =. note c = y ()/z ()

152 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection equation g y = cg z can be solved in different ways: calculate composite derivatives using Faá di Bruno s formula derivatives of y and z lead to estimation of derivatives of g least-squares P polynomial fit of g using energy function E = 1 T 2T i=1 (g(y(t i)) cg(z(t i ))) 2 MLP approximation of g using E from above fix g() = and g () = 1.

153 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection equation g y = cg z can be solved in different ways: calculate composite derivatives using Faá di Bruno s formula derivatives of y and z lead to estimation of derivatives of g least-squares P polynomial fit of g using energy function E = 1 T 2T i=1 (g(y(t i)) cg(z(t i ))) 2 MLP approximation of g using E from above fix g() = and g () = 1.

154 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Artificial mixtures artificial example postnonlinear mixture of n = 3 uniform sources (1 5 samples) to m = 2 observations postnonlinear mixing model x = f 1 f «2(As) mixing matrix A = postnonlinearities f 1(x) = tanh(x) +.1x and f 2(x) = x algorithm MLP based postnonlinearity detection algorithm natural gradient-descent learning parameters: 9 hidden neurons, learning rate of η =.1 and 1 5 iterations

155 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Artificial mixtures artificial example postnonlinear mixture of n = 3 uniform sources (1 5 samples) to m = 2 observations postnonlinear mixing model x = f 1 f «2(As) mixing matrix A = postnonlinearities f 1(x) = tanh(x) +.1x and f 2(x) = x algorithm MLP based postnonlinearity detection algorithm natural gradient-descent learning parameters: 9 hidden neurons, learning rate of η =.1 and 1 5 iterations

156 Independent component analysis Sparse component analysis Nonlinear sparse component analysis PNL detection f 1 f mixing pnls f

157 Independent component analysis Sparse component analysis Nonlinear sparse component analysis PNL detection f 1 f 2 g 1 g mixing pnls f separating pnls g 5 5 5

158 Independent component analysis Sparse component analysis Nonlinear sparse component analysis PNL detection f 1 f 2 g 1 g mixing pnls f separating pnls g SIRs: 26, 71 and 46 db density of recovered sources

159 analyze statistical patterns in data sets x(t) method: factorization model x(t) = f (s(t)) supervised training of f nearest neighbor (local), regression (global) unsupervised identification (often linear) clustering (local model), blind source separation (linear model) applications: biomedical data analysis, signal processing, financial markets etc.

160 Current application with T. Schröder, HMGU unsupervised clustering of subtrees supervised learning of cell shapes parameter estimation of dynamical system for cell fate decision

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

NOTE Communicated by Jean-François Cardoso Natural Gradient Learning for Over- and Under-Complete Bases in ICA Shun-ichi Amari RIKEN Brain Science Institute, Wako-shi, Hirosawa, Saitama 351-01, Japan Independent