Biomedical signal processing application of optimization methods for machine learning problems

Size: px
Start display at page:

Download "Biomedical signal processing application of optimization methods for machine learning problems"

Transcription

1 Biomedical signal processing application of optimization methods for machine learning problems Fabian J. Theis Computational Modeling in Biology Institute of Bioinformatics and Systems Biology Helmholtz Zentrum München Grenoble, 16-Sep-28

2 Data mining cocktail-party problem

3 Data mining cocktail-party problem

4 Data mining cocktail-party problem

5 Neural Network Data mining cocktail-party problem W

6 Data mining mixture model x(t) = f(s(t)) estimate mixing process f and sources s(t) often linear f = A s(t) x(t) ŝ(t) W Neural Network

7 Outline 1 Supervised methods Motivation 1: classification Motivation 2: image segmentation Statistical decision theory 2 Clustering k-means Partitional clustering 3 Independent component analysis Sparse component analysis Nonlinear sparse component analysis 4

8 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Outline 1 Supervised methods Motivation 1: classification Motivation 2: image segmentation Statistical decision theory 2 Clustering k-means Partitional clustering 3 Independent component analysis Sparse component analysis Nonlinear sparse component analysis 4

9 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Motivation 1: classification data analysis: classification decide between (two or multiple) classes s(t) {, 1} learn by example? f g

10 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Neural networks

11 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Classification: example observations: immunological data set 3 cell parameters of 37 children with pulmonary diseases goal interpretation using supervised and unsupervised analysis disease classification into chronic bronchitis or interstitial lung disease CB ILD? cooperation with D. Hartl, Pediatric Immunology, Munich

12 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Classification: example observations: immunological data set 3 cell parameters of 37 children with pulmonary diseases goal interpretation using supervised and unsupervised analysis disease classification into chronic bronchitis or interstitial lung disease CB ILD? cooperation with D. Hartl, Pediatric Immunology, Munich

13 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Data visualization & dimension reduction parameter interpretation?

14 CB(3) ILD(1) ILD(2) CB(1) ILD(3) CB(2) CB(1) CB(1) ILD(2) ILD(1) CB(1) ILD(1) CB(1) CB(1) ILD(1) ILD(2) ILD(1) ILD(1) ILD(2) CB(2) ILD(1) ILD(2) CB(1) no(2) O(1) no(3) x(1) x(2) O(1) x(3) O(2) no(1) O(1) x(2) x(1) O(1) x(1) O(1) O(1) x(1) x(2) x(1) x(1) x(2) O(2) x(1) x(2) O(1) Supervised methods Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Data visualization & dimension reduction d d d d d d d d d d d d 4 d 1.35 d 196 d 1 K means Clusters 9.29 CB(3) 4.44 d.623 visualization by self-organizing map network topology-preserving nonlinear dimension reduction/scaling detect new parameter dependencies

15 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Disease classification dimensionreducing network z(i) = B supervised Aunsup.x(i) results: down-scaling to 5 hidden neurons suffices classification rate of > 9% [Theis, Hartl, Krauss-Etschmann, Lang. Neural network signal analysis in immunology. Proc. ISSPA 23.]

16 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Disease classification dimensionreducing network z(i) = B supervised Aunsup.x(i) results: down-scaling to 5 hidden neurons suffices classification rate of > 9% [Theis, Hartl, Krauss-Etschmann, Lang. Neural network signal analysis in immunology. Proc. ISSPA 23.]

17 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Motivation 2: image segmentation classification application in image processing object classification? f g

18 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Motivation 2: image segmentation Problem: How many labelled cells lie in this section image?

19 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Biological background: neurogenesis adult neurogenesis new neurons emerge even in the adult human brain level depends on external stimuli Are there neural ancestral cells? goal automated quantification of neurogenesis in adult mice cooperation with Z. Kohl, Department of Neurology, University of Regensburg

20 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Automated cell counting

21 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Automated cell counting directional neural network train cell patch classifier ζ using directional neural network scan image using ζ to get cell positions speed-up via hierarchical and multiscale methods

22 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Automated cell counting directional neural network train cell patch classifier ζ using directional neural network scan image using ζ to get cell positions speed-up via hierarchical and multiscale methods

23 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Results counting comparison with 2 experts (variability ±5%) yields 9% ± 4% accuracy application: considerable cell proliferation in hippocampus of epileptic mice [Theis, Kohl, Guggenberger, Kuhn, Lang. ZANE - an algorithm for counting labelled cells in section images. Proc. MEDSIP 24]

24 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decision theory setup input: random vector X : Ω R p output: random vector Y : Ω R or categorical output, possibly Y {, 1} input-output relation measured by joint density P(X, Y ) realization by samples (training data) (x i, y i ) for i = 1,..., N often collected in (N p)-matrix X and vector y R N

25 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Goal: prediction 1.5 goal: learn classificator from training data predict y for new sample x

26 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Linear model p ŷ = ˆβ + x j ˆβj j=1 set x := 1, then ŷ = x β least squares: minimize RSS(β) = X (y Xβ) = so N i=1 (y i x i β) 2 = (y Xβ) (y Xβ) ˆβ = (X X) 1 X y

27 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Linear model p ŷ = ˆβ + x j ˆβj j=1 set x := 1, then ŷ = x β least squares: minimize RSS(β) = X (y Xβ) = so N i=1 (y i x i β) 2 = (y Xβ) (y Xβ) ˆβ = (X X) 1 X y

28 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Linear model p ŷ = ˆβ + x j ˆβj j=1 set x := 1, then ŷ = x β least squares: minimize RSS(β) = X (y Xβ) = so N i=1 (y i x i β) 2 = (y Xβ) (y Xβ) ˆβ = (X X) 1 X y

29 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Linear model p ŷ = ˆβ + x j ˆβj j=1 set x := 1, then ŷ = x β least squares: minimize RSS(β) = X (y Xβ) = so N i=1 (y i x i β) 2 = (y Xβ) (y Xβ) ˆβ = (X X) 1 X y

30 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Linear model decision boundary {x x β = 1/2}

31 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Linear model nice, but what about more complex data? (r = 2 and r = 1 Gaussians per class, σ =.2, with r means sampled from N((1, ), I and N((, 1), I), respectively)

32 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Linear model hm? global, linear model is too rigid

33 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Nearest-neighbor method ŷ = 1 k x i N k (x) if N k (x) equal the k closest points x i to x local model needs metric (here Euclidean) how to determine k? smaller k higher learning accuracy larger k smoother, higher generalizability least-square learning would yield k = 1 y i

34 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Nearest-neighbor method ŷ = 1 k x i N k (x) if N k (x) equal the k closest points x i to x local model needs metric (here Euclidean) how to determine k? smaller k higher learning accuracy larger k smoother, higher generalizability least-square learning would yield k = 1 y i

35 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Nearest-neighbor method, k = decision boundary {x ŷ(x) = 1/2}

36 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Nearest-neighbor method, k = 1, 2,

37 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions probabilistic view: P(X, Y ) = P(Y X )P(X ) find function f (X ) predicting Y as well as possible w.r.t. squared error loss L(Y, f (X )) = (Y f (X )) 2 expected prediction error EPE(f ) = E(Y f (X )) 2 = (y f (x)) 2 P(dx, dy) = E X E Y X ((Y f (X )) 2 X ) pointwise minimization suffices f (x) = argmin c E Y X ((Y c) 2 X = x) solved at conditional expectation (regression function) f (x) = E(Y X = x)

38 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions probabilistic view: P(X, Y ) = P(Y X )P(X ) find function f (X ) predicting Y as well as possible w.r.t. squared error loss L(Y, f (X )) = (Y f (X )) 2 expected prediction error EPE(f ) = E(Y f (X )) 2 = (y f (x)) 2 P(dx, dy) = E X E Y X ((Y f (X )) 2 X ) pointwise minimization suffices f (x) = argmin c E Y X ((Y c) 2 X = x) solved at conditional expectation (regression function) f (x) = E(Y X = x)

39 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions can be estimated by f (x) = E(Y X = x) ˆf (x) = 1 k x i N k (x) y i approximate expectation via sample averages approximate point conditioning to local conditioning note: ˆf (x) E(Y X = x) for N, K, k/n but: (very) finite samples curse of dimensionality fraction r of unit cube in p dimensions is covered by cube of edge length e p(r) = r 1/p e 2 (.1) =.1, e 2 (.1) =.32 e 1 (.1) =.63, e 1 (.1) =.8

40 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions can be estimated by f (x) = E(Y X = x) ˆf (x) = 1 k x i N k (x) y i approximate expectation via sample averages approximate point conditioning to local conditioning note: ˆf (x) E(Y X = x) for N, K, k/n but: (very) finite samples curse of dimensionality fraction r of unit cube in p dimensions is covered by cube of edge length e p(r) = r 1/p e 2 (.1) =.1, e 2 (.1) =.32 e 1 (.1) =.63, e 1 (.1) =.8

41 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions can be estimated by f (x) = E(Y X = x) ˆf (x) = 1 k x i N k (x) y i approximate expectation via sample averages approximate point conditioning to local conditioning note: ˆf (x) E(Y X = x) for N, K, k/n but: (very) finite samples curse of dimensionality fraction r of unit cube in p dimensions is covered by cube of edge length e p(r) = r 1/p e 2 (.1) =.1, e 2 (.1) =.32 e 1 (.1) =.63, e 1 (.1) =.8

42 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions if instead for approximating f (x) = E(Y X = x), we assume linear model f (x) = x β, we get β = E(XX ) 1 E(XY ) no conditioning, global approximation

43 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions if instead for approximating f (x) = E(Y X = x), we assume linear model f (x) = x β, we get β = E(XX ) 1 E(XY ) no conditioning, global approximation

44 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions for discrete Y if Y {, 1}, consider loss function { if f(x)=y L(Y, f (X )) = 1 otherwise then EPE = E X y {,1} L(y, f (X ))P(y X ) and hence Ŷ (x) = argmin y {,1} which yields the Bayes classifier question: how to model P(Y X )? y {,1} L(y, y )P(y X = x) = argmin y {,1} 1 P(y X = x) Ŷ (x) = argmax y P(y X = x)

45 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions for discrete Y if Y {, 1}, consider loss function { if f(x)=y L(Y, f (X )) = 1 otherwise then EPE = E X y {,1} L(y, f (X ))P(y X ) and hence Ŷ (x) = argmin y {,1} which yields the Bayes classifier question: how to model P(Y X )? y {,1} L(y, y )P(y X = x) = argmin y {,1} 1 P(y X = x) Ŷ (x) = argmax y P(y X = x)

46 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Statistical decisions for discrete Y if Y {, 1}, consider loss function { if f(x)=y L(Y, f (X )) = 1 otherwise then EPE = E X y {,1} L(y, f (X ))P(y X ) and hence Ŷ (x) = argmin y {,1} which yields the Bayes classifier question: how to model P(Y X )? y {,1} L(y, y )P(y X = x) = argmin y {,1} 1 P(y X = x) Ŷ (x) = argmax y P(y X = x)

47 Bayes classifier results Supervised methods Motivation 1: classification Motivation 2: image segmentation Statistical decision theory

48 Motivation 1: classification Motivation 2: image segmentation Statistical decision theory Method combinations nonlinear models e.g. f (x) = p j=1 f j(x j ) or basis expansion f (x) = j h j(x)β j with polynomial, Fourier or sigmoidal bases ( neural networks) prediction/function approximation by maximum-likelihood estimation of parameters enhance generalizability by adding regularization term +λj(f ) to RSS(f ) for f from some function class generalize inner-product methods to nonlinear situations by high-dimensional embedding x Φ(x) and kernels k(x, x ) = Φ(x) Φ(x)

49 Clustering k-means Partitional clustering Outline 1 Supervised methods Motivation 1: classification Motivation 2: image segmentation Statistical decision theory 2 Clustering k-means Partitional clustering 3 Independent component analysis Sparse component analysis Nonlinear sparse component analysis 4

50 Clustering k-means Partitional clustering Clustering explanation by example goal: differentiate hand-written digits 2 and 4 given a set of unknown gray-scale images of 2s and 4s, find the subset of 2s and the subset of 4s unsupervised learning by example

51 Clustering k-means Partitional clustering Clustering explanation by example goal: differentiate hand-written digits 2 and 4 given a set of unknown gray-scale images of 2s and 4s, find the subset of 2s and the subset of 4s versus unsupervised learning by example

52 Clustering k-means Partitional clustering Clustering explanation by example goal: differentiate hand-written digits 2 and 4 given a set of unknown gray-scale images of 2s and 4s, find the subset of 2s and the subset of 4s like a baby: versus unsupervised learning by example

53 Clustering k-means Partitional clustering Example data set here: machine learning i.e. statistical approach needs many test cases: dimension reduction via PCA to only 2 dimensions here 1 28x28 images each interpret each 28x28-image as element of R 784 :

54 Clustering k-means Partitional clustering Example data set here: machine learning i.e. statistical approach needs many test cases: dimension reduction via PCA to only 2 dimensions here 1 28x28 images each interpret each 28x28-image as element of R 784 :

55 Clustering k-means Partitional clustering Example data set here: machine learning i.e. statistical approach needs many test cases: dimension reduction via PCA to only 2 dimensions here 1 28x28 images each interpret each 28x28-image as element of R 784 :......

56 Clustering k-means Partitional clustering Example data set here: machine learning i.e. statistical approach needs many test cases: dimension reduction via PCA to only 2 dimensions here 1 28x28 images each interpret each 28x28-image as element of R 784 :......

57 Clustering k-means Partitional clustering Example data set here: machine learning i.e. statistical approach needs many test cases: dimension reduction via PCA to only 2 dimensions here 1 28x28 images each interpret each 28x28-image as element of R 784 :

58 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i Px C i d(x i, c i ) 2 [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

59 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i Px C i d(x i, c i ) 2 Samples Centroids [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

60 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 batch k-means Aufteilung [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

61 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 batch k-means Zuweisung [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

62 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 batch k-means [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

63 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 sequentieller k-means beliebiges Sample [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

64 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 sequentieller k-means nächster Centroid [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

65 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 sequentieller k-means Update [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

66 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 sequentieller k-means [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

67 Clustering k-means Partitional clustering k-means clustering: data vectors (samples) x(1), x(2),..., x(t ) R n distance measure d(x, y) between samples algorithm: k-means given number k of clusters initialize centroids randomly update rules: batch or sequential (online) cost function minimize E(c i, C i ) := P k i=1 1 C i P x C i d(x i, c i ) 2 sequentieller k-means [Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 26]

68 Clustering k-means Partitional clustering Batch k-means 3 k means after 1 iteration done: error 4.5%

69 Clustering k-means Partitional clustering Batch k-means 3 k means after 2 iterations done: error 4.5%

70 Clustering k-means Partitional clustering Batch k-means 3 k means after 3 iterations done: error 4.5%

71 Clustering k-means Partitional clustering Batch k-means 3 k means after 4 iterations done: error 4.5%

72 Clustering k-means Partitional clustering Batch k-means 3 k means after 5 iterations done: error 4.5%

73 Clustering k-means Partitional clustering Batch k-means 3 k means after 6 iterations done: error 4.5%

74 Clustering k-means Partitional clustering Batch k-means 3 k means after 7 iterations done: error 4.5%

75 Clustering k-means Partitional clustering Partitional clustering goal: given a set A of points in metric space (M, d) find partition of A into B i, S i B i = A, and centroids c i M minimizing E(B 1, c 1,..., B k, c k ) := kx X i=1 a B i d(a, c i ) 2. (1) A = {a 1,..., a T } constrained non-linear opt. problem minimize kx TX E(W, C) := w it d(a i, c i ) 2. (2) subject to w it {, 1}, i=1 t=1 kx w it = 1 for 1 i k, 1 t T. (3) i=1 C := {c 1,..., c k } centroid locations, W := (w it ) partition matrix

76 Clustering k-means Partitional clustering Partitional clustering goal: given a set A of points in metric space (M, d) find partition of A into B i, S i B i = A, and centroids c i M minimizing E(B 1, c 1,..., B k, c k ) := kx X i=1 a B i d(a, c i ) 2. (1) A = {a 1,..., a T } constrained non-linear opt. problem minimize kx TX E(W, C) := w it d(a i, c i ) 2. (2) subject to w it {, 1}, i=1 t=1 kx w it = 1 for 1 i k, 1 t T. (3) i=1 C := {c 1,..., c k } centroid locations, W := (w it ) partition matrix

77 Clustering k-means Partitional clustering Minimize this! common approach: partial optimization for W and C alternate minimization of either W and C while keeping the other one fixed batch k-means algorithm initial random choice of centroids c 1,..., c k iterate until convergence: cluster assignment: for each a t determine an index i(t) such that i(t) = argmin i d(a t, c i ) cluster update: within each cluster B i := {a t i(t) = i} determine the centroid c i by minimizing c i := argmin c X a B i d(a, c) 2 convergence to local minimum (??)

78 Clustering k-means Partitional clustering Minimize this! common approach: partial optimization for W and C alternate minimization of either W and C while keeping the other one fixed batch k-means algorithm initial random choice of centroids c 1,..., c k iterate until convergence: cluster assignment: for each a t determine an index i(t) such that i(t) = argmin i d(a t, c i ) cluster update: within each cluster B i := {a t i(t) = i} determine the centroid c i by minimizing c i := argmin c X a B i d(a, c) 2 convergence to local minimum (??)

79 Clustering k-means Partitional clustering Minimize this! common approach: partial optimization for W and C alternate minimization of either W and C while keeping the other one fixed batch k-means algorithm initial random choice of centroids c 1,..., c k iterate until convergence: cluster assignment: for each a t determine an index i(t) such that i(t) = argmin i d(a t, c i ) cluster update: within each cluster B i := {a t i(t) = i} determine the centroid c i by minimizing c i := argmin c X a B i d(a, c) 2 convergence to local minimum (??)

80 Clustering k-means Partitional clustering Euclidean case special case: M := R n and the Euclidean distance d(x, y) := x y centroids can be calculated in closed form: centroid is given by the cluster mean c i := (1/ B i ) X a B i a this follows directly from X X a c i 2 = a B i a B i j=1 nx (a j c ij ) 2 = nx X j=1 a B i (a 2 j 2a j c ij + c 2 ij )

81 Clustering k-means Partitional clustering Euclidean case special case: M := R n and the Euclidean distance d(x, y) := x y centroids can be calculated in closed form: centroid is given by the cluster mean c i := (1/ B i ) X a B i a this follows directly from X X a c i 2 = a B i a B i j=1 nx (a j c ij ) 2 = nx X j=1 a B i (a 2 j 2a j c ij + c 2 ij )

82 Clustering k-means Partitional clustering Euclidean case special case: M := R n and the Euclidean distance d(x, y) := x y centroids can be calculated in closed form: centroid is given by the cluster mean c i := (1/ B i ) X a B i a this follows directly from X X a c i 2 = a B i a B i j=1 nx (a j c ij ) 2 = nx X j=1 a B i (a 2 j 2a j c ij + c 2 ij )

83 Clustering k-means Partitional clustering Extensions c i := argmin c a B i d(a, c) p more difficult optimization problems: non-euclidean spaces e.g. RP n or Grassmann manifolds extensions from p = 2 to e.g. p = 1 or p < p = 1 corresponds to finding the spatial median of B i

84 Clustering k-means Partitional clustering Extensions c i := argmin c a B i d(a, c) p more difficult optimization problems: non-euclidean spaces e.g. RP n or Grassmann manifolds extensions from p = 2 to e.g. p = 1 or p < p = 1 corresponds to finding the spatial median of B i

85 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Outline 1 Supervised methods Motivation 1: classification Motivation 2: image segmentation Statistical decision theory 2 Clustering k-means Partitional clustering 3 Independent component analysis Sparse component analysis Nonlinear sparse component analysis 4

86 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Independent component analysis example: Cocktail party problem of the brain auditory cortex auditory cortex 2 word detection decision [Keck, Theis, Gruber, Lang, Specht, Puntonet. 3D spatial analysis of fmri data on a word perception task. LNCS, 3195: ]

87 Independent component analysis Sparse component analysis Nonlinear sparse component analysis BSS model Blind source separation (BSS) problem x(t) = As(t) + ɛ(t) x(t) observed m-dimensional random vector A (unknown) full-rank m n matrix s(t) (unknown) n-dimensional source signals (here: n m) ɛ(t) (unknown) white noise goal: given x, recover A and s! additional assumptions necessary stochastically independent s(t): p s(s 1,..., s n) = p s1 (s 1)... p sn (s n) independent component analysis (ICA) sparse source signals s i (t) sparse component analysis (SCA) nonnegative s and A nonnegative matrix factorization (NMF)

88 Independent component analysis Sparse component analysis Nonlinear sparse component analysis BSS model Blind source separation (BSS) problem x(t) = As(t) + ɛ(t) x(t) observed m-dimensional random vector A (unknown) full-rank m n matrix s(t) (unknown) n-dimensional source signals (here: n m) ɛ(t) (unknown) white noise goal: given x, recover A and s! additional assumptions necessary stochastically independent s(t): p s(s 1,..., s n) = p s1 (s 1)... p sn (s n) independent component analysis (ICA) sparse source signals s i (t) sparse component analysis (SCA) nonnegative s and A nonnegative matrix factorization (NMF)

89 Independent component analysis Sparse component analysis Nonlinear sparse component analysis important questions in data analysis model? (restrictions to A and s) indeterminacies of the model? algorithmic identification given x? identifiability obvious indeterminacies: scaling L and permutation P Theorem Let the independent random vector s L 2 contain at most one gaussian component. Given two ICA solutions As = A s, then A = A LP. Note: theorem does not hold for gaussian sources s. [Theis. A new concept for separability problems in blind source separation. Neural Computation, 24]

90 Independent component analysis Sparse component analysis Nonlinear sparse component analysis important questions in data analysis model? (restrictions to A and s) indeterminacies of the model? algorithmic identification given x? identifiability obvious indeterminacies: scaling L and permutation P Theorem Let the independent random vector s L 2 contain at most one gaussian component. Given two ICA solutions As = A s, then A = A LP. Note: theorem does not hold for gaussian sources s. [Theis. A new concept for separability problems in blind source separation. Neural Computation, 24]

91 Independent component analysis Sparse component analysis Nonlinear sparse component analysis important questions in data analysis model? (restrictions to A and s) indeterminacies of the model? algorithmic identification given x? identifiability obvious indeterminacies: scaling L and permutation P Theorem Let the independent random vector s L 2 contain at most one gaussian component. Given two ICA solutions As = A s, then A = A LP. Note: theorem does not hold for gaussian sources s. [Theis. A new concept for separability problems in blind source separation. Neural Computation, 24]

92 Independent component analysis Sparse component analysis Nonlinear sparse component analysis important questions in data analysis model? (restrictions to A and s) indeterminacies of the model? algorithmic identification given x? identifiability obvious indeterminacies: scaling L and permutation P Theorem Let the independent random vector s L 2 contain at most one gaussian component. Given two ICA solutions As = A s, then A = A LP Note: theorem does not hold for gaussian sources s [Theis. A new concept for separability problems in blind source separation. Neural Computation, 24]

93 Independent component analysis Sparse component analysis Nonlinear sparse component analysis ICA algorithms basic scheme of ICA algorithms (case m = n) search for invertible demixing matrix W that minimizes some dependence measure of Wx some contrasts minimize mutual information I (Wx) (?) maximize neural network output entropy H(f (Wx)) (?) extend PCA by performing nonlinear decorrelation (?) maximize non-gaussianity of output components (Wx) i (?) minimize off-diagonal error of H ln pwx minimize median deviation of Wx [Theis et al. Linear geometric ICA: Fundamentals and algorithms. Neural Computation, 23] [Theis, Lang, Puntonet. A geometric algorithm for overcomplete linear ICA. Neurocomputing, 24]

94 Independent component analysis Sparse component analysis Nonlinear sparse component analysis ICA algorithms basic scheme of ICA algorithms (case m = n) search for invertible demixing matrix W that minimizes some dependence measure of Wx some contrasts minimize mutual information I (Wx) (?) maximize neural network output entropy H(f (Wx)) (?) extend PCA by performing nonlinear decorrelation (?) maximize non-gaussianity of output components (Wx) i (?) minimize off-diagonal error of H ln pwx minimize median deviation of Wx [Theis et al. Linear geometric ICA: Fundamentals and algorithms. Neural Computation, 23] [Theis, Lang, Puntonet. A geometric algorithm for overcomplete linear ICA. Neurocomputing, 24]

95 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Optimization problem: minimize cost function f (W) on Gl(n) or O(n) often: gradient descent: W f (W) in high dimensions: simulated annealing or genetic algorithms use non-euclidean structure of Gl(n) Euclidean gradient not compatible with group Gl(n) define natural gradient nat f (W) = euc f (W)W W considerable performance increase [Stadlthanner, Theis, Puntonet, Lang. Extended sparse nonnegative matrix factorization. LNCS, 3512: ] [Squartini, Theis. New Riemannian metrics for speeding-up the convergence of over- and underdetermined ICA. In preparation] [Theis. Gradients on matrix manifolds and their chain rule. Submitted to NIPS LR, 25]

96 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Optimization problem: minimize cost function f (W) on Gl(n) or O(n) often: gradient descent: W f (W) in high dimensions: simulated annealing or genetic algorithms use non-euclidean structure of Gl(n) Euclidean gradient not compatible with group Gl(n) define natural gradient nat f (W) = euc f (W)W W considerable performance increase [Stadlthanner, Theis, Puntonet, Lang. Extended sparse nonnegative matrix factorization. LNCS, 3512: ] [Squartini, Theis. New Riemannian metrics for speeding-up the convergence of over- and underdetermined ICA. In preparation] [Theis. Gradients on matrix manifolds and their chain rule. Submitted to NIPS LR, 25]

97 Independent component analysis Sparse component analysis Nonlinear sparse component analysis fmri analysis function magnetic resonance imaging noninvasive brain imaging technique information on brain activation patterns activation maps help identifying task-related brain regions BSS techniques for fmri possible, see (?).

98 Independent component analysis Sparse component analysis Nonlinear sparse component analysis fmri analysis spatial-only BSS function magnetic resonance imaging noninvasive brain imaging technique information on brain activation patterns activation maps help identifying task-related brain regions BSS techniques for fmri possible, see (?).

99 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Experimental setup experiment block design protocol: 5 time instants of visual stimulation 5 instants of rest 1 scans taking 3s each data set well known design expected activity in visual cortex here: use only a single horizontal slice preprocessing motion correction smoothing data acquired by D. Auer, MPI of Psychiatry, Munich

100 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Results cc:.18 2 cc: cc:.5 4 cc:.9 (a) spatial sources s S (b) temporal sources t S component 2 partially represents the frontal eye fields component 4: stimulus component, cc =.9 with stimulus [Theis, Gruber, Keck, Lang. Functional MRI analysis by a novel spatiotemporal ICA algorithm. LNCS 3696: ]

101 Independent component analysis Sparse component analysis Nonlinear sparse component analysis

102 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Why extend ICA? identifiability of ICA only holds if data follows generative model with independent sources simulation apply ICA to data not fulfilling the ICA model here sources consist of a 2d- and a 1-d irreducible component plot Amari-error over 1 runs.

103 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Why extend ICA? identifiability of ICA only holds if data follows generative model with independent sources simulation apply ICA to data not fulfilling the ICA model here sources consist of a 2d- and a 1-d irreducible component plot Amari-error over 1 runs crosstalking error FastICA JADE Extended Infomax result: no recovery of mixing matrix

104 Independent component analysis Sparse component analysis Nonlinear sparse component analysis require stochastic independence only between groups of source components nk-dimensional S is to be k-independent i.e. 1 1 S 1. S k C B S nk k+1. S nk mutually independent independent subspace analysis (ISA) recent result: extension to arbitrary group-size major advantage: general independent subspace analysis (ISA) always exists C A [Theis. Uniqueness of complex and multidimensional independent component analysis. Signal Processing, 24]

105 Independent component analysis Sparse component analysis Nonlinear sparse component analysis require stochastic independence only between groups of source components nk-dimensional S is to be k-independent i.e. 1 1 S 1. S k C B S nk k+1. S nk mutually independent independent subspace analysis (ISA) recent result: extension to arbitrary group-size major advantage: general independent subspace analysis (ISA) always exists C A [Theis. Uniqueness of complex and multidimensional independent component analysis. Signal Processing, 24]

106 Independent component analysis Sparse component analysis Nonlinear sparse component analysis PCA X A S

107 Independent component analysis Sparse component analysis Nonlinear sparse component analysis ICA X A S L P

108 Independent component analysis Sparse component analysis Nonlinear sparse component analysis ISA with fixed groupsize X A S L P

109 Independent component analysis Sparse component analysis Nonlinear sparse component analysis General ISA X A S L P

110 Independent component analysis Sparse component analysis Nonlinear sparse component analysis ISA framework Definition Y independent component of X : X = A(Y, Z) such that Y and Z are stochastically independent. Definition (general ISA) S is irreducible if it contains no lower-dim. independent cpt. W Gl(n) independent subspace analysis of X : WX = (S 1,..., S k ) with pairwise independent, irreducible S i Theorem Given a random vector X with existing covariance, then an ISA of X exists and is unique except for scaling and permutation.

111 Independent component analysis Sparse component analysis Nonlinear sparse component analysis ISA framework Definition Y independent component of X : X = A(Y, Z) such that Y and Z are stochastically independent. Definition (general ISA) S is irreducible if it contains no lower-dim. independent cpt. W Gl(n) independent subspace analysis of X : WX = (S 1,..., S k ) with pairwise independent, irreducible S i Theorem Given a random vector X with existing covariance, then an ISA of X exists and is unique except for scaling and permutation.

112 Independent component analysis Sparse component analysis Nonlinear sparse component analysis ISA framework Definition Y independent component of X : X = A(Y, Z) such that Y and Z are stochastically independent. Definition (general ISA) S is irreducible if it contains no lower-dim. independent cpt. W Gl(n) independent subspace analysis of X : WX = (S 1,..., S k ) with pairwise independent, irreducible S i Theorem Given a random vector X with existing covariance, then an ISA of X exists and is unique except for scaling and permutation.

113 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Algebraic ISA algorithms main idea: source condition matrices C i (S) are block-diagonal subspace JADE after whitening assume orthogonal A group-independence of S: contracted quadricovariance matrices C ij (S) are block-diagonal perform joint block diagonalization of {C ij (X)} to get A for general ISA, estimate block-structure after diagonalization C ij (S) = A C ij (X) A [Theis. Towards a general independent subspace analysis. NIPS 26 accepted]

114 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Algebraic ISA algorithms main idea: source condition matrices C i (S) are block-diagonal subspace JADE after whitening assume orthogonal A group-independence of S: contracted quadricovariance matrices C ij (S) are block-diagonal perform joint block diagonalization of {C ij (X)} to get A for general ISA, estimate block-structure after diagonalization C ij (S) = A C ij (X) A [Theis. Towards a general independent subspace analysis. NIPS 26 accepted]

115 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Joint Block Diagonalization with unknown block-sizes Joint Block Diagonalization (JBD) given n n-matrices C 1,..., C K and a partition m, m m r = n of n goal: find orthogonal A such that k: A C k A is m-block-diagonal minimize (e.g. by applying iterative Givens-rotations) K f m (Â) :=  C k  diagm m ( C k Â) 2 F k=1 unknown blocksize m general JBD then searches for maximal-length block structure i.e. (A, m) = argmax m A:f m (A)= m result: JBD by JD: any block-optimal JBD i.e. zero of f m is a local minimum of ordinary joint diagonalization.

116 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Joint Block Diagonalization with unknown block-sizes Joint Block Diagonalization (JBD) given n n-matrices C 1,..., C K and a partition m, m m r = n of n goal: find orthogonal A such that k: A C k A is m-block-diagonal minimize (e.g. by applying iterative Givens-rotations) K f m (Â) :=  C k  diagm m ( C k Â) 2 F k=1 unknown blocksize m general JBD then searches for maximal-length block structure i.e. (A, m) = argmax m A:f m (A)= m result: JBD by JD: any block-optimal JBD i.e. zero of f m is a local minimum of ordinary joint diagonalization.

117 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Joint Block Diagonalization with unknown block-sizes Joint Block Diagonalization (JBD) given n n-matrices C 1,..., C K and a partition m, m m r = n of n goal: find orthogonal A such that k: A C k A is m-block-diagonal minimize (e.g. by applying iterative Givens-rotations) K f m (Â) :=  C k  diagm m ( C k Â) 2 F k=1 unknown blocksize m general JBD then searches for maximal-length block structure i.e. (A, m) = argmax m A:f m (A)= m result: JBD by JD: any block-optimal JBD i.e. zero of f m is a local minimum of ordinary joint diagonalization.

118 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Joint Block Diagonalization with unknown block-sizes Joint Block Diagonalization (JBD) given n n-matrices C 1,..., C K and a partition m, m m r = n of n goal: find orthogonal A such that k: A C k A is m-block-diagonal minimize (e.g. by applying iterative Givens-rotations) K f m (Â) :=  C k  diagm m ( C k Â) 2 F k=1 unknown blocksize m general JBD then searches for maximal-length block structure i.e. (A, m) = argmax m A:f m (A)= m result: JBD by JD: any block-optimal JBD i.e. zero of f m is a local minimum of ordinary joint diagonalization.

119 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Example (unknown) C 1  A w/o rec. P  A. performance of the proposed general JBD (unknown) block-partition 4 = additive noise with SNR of 5dB, K = 1 matrices result: estimate  equals A after permutation recovery

120 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Extraction of fetal electrocardiograms separate fetal ECG (FECG) recordings from the mother s ECG (MECG) apply Hessian-based MICA with k = 2 and 5 Hessians

121 Independent component analysis Sparse component analysis Nonlinear sparse component analysis (a) ECG recordings (b) extracted sources (c) MECG part (d) FECG part

122 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Sparse component analysis sparse [Theis, Puntonet, Lang. Median-based clustering for underdetermined blind signal processing. IEEE SPL, 25]

123 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Model Sparse Component Analysis (SCA) problem x(t) = As(t) observed mixtures x(t) R m A (unknown) real matrix with linearly independent columns s(t) (unknown) (m 1)-sparse sources s(t) R n i.e. s(t) has at most (m 1) non-zeros goal: recover unknown A and s(t) given only x(t) Theorem If s(t) is (m 1)-sparse and A and s(t) in general position, both A and s(t) are identifiable (except for scaling and permutation). [Georgiev, Theis, Cichocki. Sparse component analysis and blind source separation of underdetermined mixtures. IEEE TNN, 25]

124 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Model Sparse Component Analysis (SCA) problem x(t) = As(t) observed mixtures x(t) R m A (unknown) real matrix with linearly independent columns s(t) (unknown) (m 1)-sparse sources s(t) R n i.e. s(t) has at most (m 1) non-zeros goal: recover unknown A and s(t) given only x(t) Theorem If s(t) is (m 1)-sparse and A and s(t) in general position, both A and s(t) are identifiable (except for scaling and permutation). [Georgiev, Theis, Cichocki. Sparse component analysis and blind source separation of underdetermined mixtures. IEEE TNN, 25]

125 Independent component analysis Sparse component analysis Nonlinear sparse component analysis SCA algorithm matrix identification by multiple hyperplane detection e.g. using Hough transform robust against outliers and noise source recovery using sparsity and known matrix [Theis, Georgiev, Cichocki. Robust sparse component analysis based on a generalized Hough transform. Signal Processing 26]

126 Independent component analysis Sparse component analysis Nonlinear sparse component analysis SCA of surface electromyograms electromyogram (EMG): electric signal generated by a contracting muscle surface EMG: non-invasive, however source overlaps cooperation with G. García, Bioinformatic Engineering, Osaka

127 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Results source and SCA recovery within 8 artificial, dependent mixtures results on toy data: sparseness works as separation criterion real data relative semg enhancement 24.6 ± 21.4% (mean over 9 subjects) beats standard signal processing and ICA [Theis, García. On the use of sparse signal decomposition in the analysis of multi-channel surface EMGs. Signal Processing, 26]

128 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Results source and SCA recovery within 8 artificial, dependent mixtures results on toy data: sparseness works as separation criterion real data relative semg enhancement 24.6 ± 21.4% (mean over 9 subjects) beats standard signal processing and ICA [Theis, García. On the use of sparse signal decomposition in the analysis of multi-channel surface EMGs. Signal Processing, 26]

129 Independent component analysis Sparse component analysis Nonlinear sparse component analysis SCA of functional MRI data cc:.16 2 cc:.28 3 cc: cc:.4 5 cc:.88 component maps (S) time courses (A) complete SCA was performed using k-means hyperplane clustering components 2 and 3 represents inner ventricles, component 1 contains the frontal eye fields component 5 is desired visual stimulus component active in the visual cortex (crosscorrelation with stimulus cc =.88 fastica yields similar cc =.9)

130 Independent component analysis Sparse component analysis Nonlinear sparse component analysis SCA of functional MRI data cc:.16 2 cc:.28 3 cc: cc:.4 5 cc:.88 component maps (S) time courses (A) complete SCA was performed using k-means hyperplane clustering components 2 and 3 represents inner ventricles, component 1 contains the frontal eye fields component 5 is desired visual stimulus component active in the visual cortex (crosscorrelation with stimulus cc =.88 fastica yields similar cc =.9)

131 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinear SCA Given m-dimensional random vector x, find representation x = f(as) with unknown n-dim. random vector s (sources) m n-matrix A (mixing matrix) diagonal invertible f = f 1... f m (postnonlinearities) postnonlinear ICA s independent (see (?)) here: SCA model s is (m 1)-sparse

132 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Overcomplete postnonlinear cocktail-party problem

133 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Overcomplete postnonlinear cocktail-party problem

134 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Overcomplete postnonlinear cocktail-party problem

135 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity identification lemma Given an invertible 2 2-matrix A, define L at as L := A([, ɛ) {} {} [, ɛ)). Lemma If a diagonal analytic diffeomorphism h := h 1 h 2 maps an L (in general position ) at again on an L at, then it is a linear scaling. h

136 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Identifiability due to linear identifiability it is enough show that if f(as) = ˆf(Âŝ) then h = ˆf 1 f is linear scaling case m = 2: image of As and Âŝ are finite unions of L s, so this follows from previous lemma h h

137 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Identifiability due to linear identifiability it is enough show that if f(as) = ˆf(Âŝ) then h = ˆf 1 f is linear scaling case m = 2: image of As and Âŝ are finite unions of L s, so this follows from previous lemma h h

138 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Identifiability due to linear identifiability it is enough show that if f(as) = ˆf(Âŝ) then h = ˆf 1 f is linear scaling case m = 2: image of As and Âŝ are finite unions of L s, so this follows from previous lemma h h

139 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Identifiability: proof A f R 3 R 2 A f R 2 R 3 R 2

140 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Algorithm multistage separation algorithm: find separating nonlinearities g estimate mixing matrix  of linearized model g(x) estimate sources given  and g(x) how can g be found algorithmically?

141 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Algorithm multistage separation algorithm: find separating nonlinearities g estimate mixing matrix  of linearized model g(x) estimate sources given  and g(x) how can g be found algorithmically?

142 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection for simplicity assume m = 2. geometrical preprocessing: determine two 1-dimensional submanifolds in the image of x find curves y(t) and z(t) in R 2 which are mapped onto an L by g. simple method: choose arbitrary starting points y(t 1) and z(t 1) among samples of x iteratively pick closest sample to previous y(t i 1 ) resp. z(t i 1 ) with smaller modulus

143 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection for simplicity assume m = 2. geometrical preprocessing: determine two 1-dimensional submanifolds in the image of x find curves y(t) and z(t) in R 2 which are mapped onto an L by g. simple method: choose arbitrary starting points y(t 1) and z(t 1) among samples of x iteratively pick closest sample to previous y(t i 1 ) resp. z(t i 1 ) with smaller modulus

144 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection for simplicity assume m = 2. geometrical preprocessing: determine two 1-dimensional submanifolds in the image of x find curves y(t) and z(t) in R 2 which are mapped onto an L by g. simple method: choose arbitrary starting points y(t 1) and z(t 1) among samples of x iteratively pick closest sample to previous y(t i 1 ) resp. z(t i 1 ) with smaller modulus

145 Supervised methods Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection F. Theis Biomedical signal processing application of optimization methods for machi

146 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Supervised methods Postnonlinearity detection f A F. Theis Biomedical signal processing application of optimization methods for machi

147 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Supervised methods Postnonlinearity detection f A mixture density F. Theis Biomedical signal processing application of optimization methods for machi

148 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Supervised methods Postnonlinearity detection f A mixture density geometrical preprocessing F. Theis Biomedical signal processing application of optimization methods for machi

149 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection reparametrization (ȳ := y y 1 1 ) of the curves gives y 1 = z 1 = id. hence g y = (g 1, ag 1 ) and g z = (g 1, bg 1 ) g 2 y 2 = ag 1 = a b g 2 z 2 analytical geometrical postnonlinearity detection: find analytical 1d diffeomorphism g with g y = cg z for c, ±1 and given curves y, z : ( 1, 1) R with y() = z() =. note c = y ()/z ()

150 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection reparametrization (ȳ := y y 1 1 ) of the curves gives y 1 = z 1 = id. hence g y = (g 1, ag 1 ) and g z = (g 1, bg 1 ) g 2 y 2 = ag 1 = a b g 2 z 2 analytical geometrical postnonlinearity detection: find analytical 1d diffeomorphism g with g y = cg z for c, ±1 and given curves y, z : ( 1, 1) R with y() = z() =. note c = y ()/z ()

151 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection reparametrization (ȳ := y y 1 1 ) of the curves gives y 1 = z 1 = id. hence g y = (g 1, ag 1 ) and g z = (g 1, bg 1 ) g 2 y 2 = ag 1 = a b g 2 z 2 analytical geometrical postnonlinearity detection: find analytical 1d diffeomorphism g with g y = cg z for c, ±1 and given curves y, z : ( 1, 1) R with y() = z() =. note c = y ()/z ()

152 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection equation g y = cg z can be solved in different ways: calculate composite derivatives using Faá di Bruno s formula derivatives of y and z lead to estimation of derivatives of g least-squares P polynomial fit of g using energy function E = 1 T 2T i=1 (g(y(t i)) cg(z(t i ))) 2 MLP approximation of g using E from above fix g() = and g () = 1.

153 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Postnonlinearity detection equation g y = cg z can be solved in different ways: calculate composite derivatives using Faá di Bruno s formula derivatives of y and z lead to estimation of derivatives of g least-squares P polynomial fit of g using energy function E = 1 T 2T i=1 (g(y(t i)) cg(z(t i ))) 2 MLP approximation of g using E from above fix g() = and g () = 1.

154 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Artificial mixtures artificial example postnonlinear mixture of n = 3 uniform sources (1 5 samples) to m = 2 observations postnonlinear mixing model x = f 1 f «2(As) mixing matrix A = postnonlinearities f 1(x) = tanh(x) +.1x and f 2(x) = x algorithm MLP based postnonlinearity detection algorithm natural gradient-descent learning parameters: 9 hidden neurons, learning rate of η =.1 and 1 5 iterations

155 Independent component analysis Sparse component analysis Nonlinear sparse component analysis Artificial mixtures artificial example postnonlinear mixture of n = 3 uniform sources (1 5 samples) to m = 2 observations postnonlinear mixing model x = f 1 f «2(As) mixing matrix A = postnonlinearities f 1(x) = tanh(x) +.1x and f 2(x) = x algorithm MLP based postnonlinearity detection algorithm natural gradient-descent learning parameters: 9 hidden neurons, learning rate of η =.1 and 1 5 iterations

156 Independent component analysis Sparse component analysis Nonlinear sparse component analysis PNL detection f 1 f mixing pnls f

157 Independent component analysis Sparse component analysis Nonlinear sparse component analysis PNL detection f 1 f 2 g 1 g mixing pnls f separating pnls g 5 5 5

158 Independent component analysis Sparse component analysis Nonlinear sparse component analysis PNL detection f 1 f 2 g 1 g mixing pnls f separating pnls g SIRs: 26, 71 and 46 db density of recovered sources

159 analyze statistical patterns in data sets x(t) method: factorization model x(t) = f (s(t)) supervised training of f nearest neighbor (local), regression (global) unsupervised identification (often linear) clustering (local model), blind source separation (linear model) applications: biomedical data analysis, signal processing, financial markets etc.

160 Current application with T. Schröder, HMGU unsupervised clustering of subtrees supervised learning of cell shapes parameter estimation of dynamical system for cell fate decision

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

Natural Gradient Learning for Over- and Under-Complete Bases in ICA NOTE Communicated by Jean-François Cardoso Natural Gradient Learning for Over- and Under-Complete Bases in ICA Shun-ichi Amari RIKEN Brain Science Institute, Wako-shi, Hirosawa, Saitama 351-01, Japan Independent

More information

Independent Component Analysis (ICA)

Independent Component Analysis (ICA) Independent Component Analysis (ICA) Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

HST.582J/6.555J/16.456J

HST.582J/6.555J/16.456J Blind Source Separation: PCA & ICA HST.582J/6.555J/16.456J Gari D. Clifford gari [at] mit. edu http://www.mit.edu/~gari G. D. Clifford 2005-2009 What is BSS? Assume an observation (signal) is a linear

More information

Independent Component Analysis. Contents

Independent Component Analysis. Contents Contents Preface xvii 1 Introduction 1 1.1 Linear representation of multivariate data 1 1.1.1 The general statistical setting 1 1.1.2 Dimension reduction methods 2 1.1.3 Independence as a guiding principle

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Independent Component Analysis Barnabás Póczos Independent Component Analysis 2 Independent Component Analysis Model original signals Observations (Mixtures)

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Independent Subspace Analysis

Independent Subspace Analysis Independent Subspace Analysis Barnabás Póczos Supervisor: Dr. András Lőrincz Eötvös Loránd University Neural Information Processing Group Budapest, Hungary MPI, Tübingen, 24 July 2007. Independent Component

More information

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II Gatsby Unit University College London 27 Feb 2017 Outline Part I: Theory of ICA Definition and difference

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Natural Image Statistics

Natural Image Statistics Natural Image Statistics A probabilistic approach to modelling early visual processing in the cortex Dept of Computer Science Early visual processing LGN V1 retina From the eye to the primary visual cortex

More information

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007 MIT OpenCourseWare http://ocw.mit.edu HST.582J / 6.555J / 16.456J Biomedical Signal and Image Processing Spring 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Independent Component Analysis

Independent Component Analysis Independent Component Analysis Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr March 4, 2009 1 / 78 Outline Theory and Preliminaries

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Unsupervised learning: beyond simple clustering and PCA

Unsupervised learning: beyond simple clustering and PCA Unsupervised learning: beyond simple clustering and PCA Liza Rebrova Self organizing maps (SOM) Goal: approximate data points in R p by a low-dimensional manifold Unlike PCA, the manifold does not have

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

CIFAR Lectures: Non-Gaussian statistics and natural images

CIFAR Lectures: Non-Gaussian statistics and natural images CIFAR Lectures: Non-Gaussian statistics and natural images Dept of Computer Science University of Helsinki, Finland Outline Part I: Theory of ICA Definition and difference to PCA Importance of non-gaussianity

More information

Independent Component Analysis on the Basis of Helmholtz Machine

Independent Component Analysis on the Basis of Helmholtz Machine Independent Component Analysis on the Basis of Helmholtz Machine Masashi OHATA *1 ohatama@bmc.riken.go.jp Toshiharu MUKAI *1 tosh@bmc.riken.go.jp Kiyotoshi MATSUOKA *2 matsuoka@brain.kyutech.ac.jp *1 Biologically

More information

Massoud BABAIE-ZADEH. Blind Source Separation (BSS) and Independent Componen Analysis (ICA) p.1/39

Massoud BABAIE-ZADEH. Blind Source Separation (BSS) and Independent Componen Analysis (ICA) p.1/39 Blind Source Separation (BSS) and Independent Componen Analysis (ICA) Massoud BABAIE-ZADEH Blind Source Separation (BSS) and Independent Componen Analysis (ICA) p.1/39 Outline Part I Part II Introduction

More information

Recursive Generalized Eigendecomposition for Independent Component Analysis

Recursive Generalized Eigendecomposition for Independent Component Analysis Recursive Generalized Eigendecomposition for Independent Component Analysis Umut Ozertem 1, Deniz Erdogmus 1,, ian Lan 1 CSEE Department, OGI, Oregon Health & Science University, Portland, OR, USA. {ozertemu,deniz}@csee.ogi.edu

More information

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004 Independent Component Analysis and Its Applications By Qing Xue, 10/15/2004 Outline Motivation of ICA Applications of ICA Principles of ICA estimation Algorithms for ICA Extensions of basic ICA framework

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Cross-Entropy Optimization for Independent Process Analysis

Cross-Entropy Optimization for Independent Process Analysis Cross-Entropy Optimization for Independent Process Analysis Zoltán Szabó, Barnabás Póczos, and András Lőrincz Department of Information Systems Eötvös Loránd University, Budapest, Hungary Research Group

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo The Fixed-Point Algorithm and Maximum Likelihood Estimation for Independent Component Analysis Aapo Hyvarinen Helsinki University of Technology Laboratory of Computer and Information Science P.O.Box 5400,

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018 CPSC 340: Machine Learning and Data Mining Sparse Matrix Factorization Fall 2018 Last Time: PCA with Orthogonal/Sequential Basis When k = 1, PCA has a scaling problem. When k > 1, have scaling, rotation,

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis and Unsupervised Learning Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood

More information

Real and Complex Independent Subspace Analysis by Generalized Variance

Real and Complex Independent Subspace Analysis by Generalized Variance Real and Complex Independent Subspace Analysis by Generalized Variance Neural Information Processing Group, Department of Information Systems, Eötvös Loránd University, Budapest, Hungary ICA Research Network

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

Robustness of Principal Components

Robustness of Principal Components PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.

More information

Robust extraction of specific signals with temporal structure

Robust extraction of specific signals with temporal structure Robust extraction of specific signals with temporal structure Zhi-Lin Zhang, Zhang Yi Computational Intelligence Laboratory, School of Computer Science and Engineering, University of Electronic Science

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:

More information

Linear and Non-Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

STK Statistical Learning: Advanced Regression and Classification

STK Statistical Learning: Advanced Regression and Classification STK4030 - Statistical Learning: Advanced Regression and Classification Riccardo De Bin debin@math.uio.no STK4030: lecture 1 1/ 42 Outline of the lecture Introduction Overview of supervised learning Variable

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA

Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA with Hiroshi Morioka Dept of Computer Science University of Helsinki, Finland Facebook AI Summit, 13th June 2016 Abstract

More information

Lecture 10: Dimension Reduction Techniques

Lecture 10: Dimension Reduction Techniques Lecture 10: Dimension Reduction Techniques Radu Balan Department of Mathematics, AMSC, CSCAMM and NWC University of Maryland, College Park, MD April 17, 2018 Input Data It is assumed that there is a set

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

DIMENSIONALITY REDUCTION METHODS IN INDEPENDENT SUBSPACE ANALYSIS FOR SIGNAL DETECTION. Mijail Guillemard, Armin Iske, Sara Krause-Solberg

DIMENSIONALITY REDUCTION METHODS IN INDEPENDENT SUBSPACE ANALYSIS FOR SIGNAL DETECTION. Mijail Guillemard, Armin Iske, Sara Krause-Solberg DIMENSIONALIY EDUCION MEHODS IN INDEPENDEN SUBSPACE ANALYSIS FO SIGNAL DEECION Mijail Guillemard, Armin Iske, Sara Krause-Solberg Department of Mathematics, University of Hamburg, {guillemard, iske, krause-solberg}@math.uni-hamburg.de

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

An Introduction to Independent Components Analysis (ICA)

An Introduction to Independent Components Analysis (ICA) An Introduction to Independent Components Analysis (ICA) Anish R. Shah, CFA Northfield Information Services Anish@northinfo.com Newport Jun 6, 2008 1 Overview of Talk Review principal components Introduce

More information

Lecture 3: Pattern Classification. Pattern classification

Lecture 3: Pattern Classification. Pattern classification EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 47 Table of contents 1 Introduction

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Why is Deep Learning so effective?

Why is Deep Learning so effective? Ma191b Winter 2017 Geometry of Neuroscience The unreasonable effectiveness of deep learning This lecture is based entirely on the paper: Reference: Henry W. Lin and Max Tegmark, Why does deep and cheap

More information

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

More information

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello Artificial Intelligence Module 2 Feature Selection Andrea Torsello We have seen that high dimensional data is hard to classify (curse of dimensionality) Often however, the data does not fill all the space

More information

Multilayer Perceptron

Multilayer Perceptron Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

More information

Semi-Blind approaches to source separation: introduction to the special session

Semi-Blind approaches to source separation: introduction to the special session Semi-Blind approaches to source separation: introduction to the special session Massoud BABAIE-ZADEH 1 Christian JUTTEN 2 1- Sharif University of Technology, Tehran, IRAN 2- Laboratory of Images and Signals

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016 CPSC 340: Machine Learning and Data Mining More PCA Fall 2016 A2/Midterm: Admin Grades/solutions posted. Midterms can be viewed during office hours. Assignment 4: Due Monday. Extra office hours: Thursdays

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

where A 2 IR m n is the mixing matrix, s(t) is the n-dimensional source vector (n» m), and v(t) is additive white noise that is statistically independ

where A 2 IR m n is the mixing matrix, s(t) is the n-dimensional source vector (n» m), and v(t) is additive white noise that is statistically independ BLIND SEPARATION OF NONSTATIONARY AND TEMPORALLY CORRELATED SOURCES FROM NOISY MIXTURES Seungjin CHOI x and Andrzej CICHOCKI y x Department of Electrical Engineering Chungbuk National University, KOREA

More information

A two-layer ICA-like model estimated by Score Matching

A two-layer ICA-like model estimated by Score Matching A two-layer ICA-like model estimated by Score Matching Urs Köster and Aapo Hyvärinen University of Helsinki and Helsinki Institute for Information Technology Abstract. Capturing regularities in high-dimensional

More information

To appear in Proceedings of the ICA'99, Aussois, France, A 2 R mn is an unknown mixture matrix of full rank, v(t) is the vector of noises. The

To appear in Proceedings of the ICA'99, Aussois, France, A 2 R mn is an unknown mixture matrix of full rank, v(t) is the vector of noises. The To appear in Proceedings of the ICA'99, Aussois, France, 1999 1 NATURAL GRADIENT APPROACH TO BLIND SEPARATION OF OVER- AND UNDER-COMPLETE MIXTURES L.-Q. Zhang, S. Amari and A. Cichocki Brain-style Information

More information