Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA with Hiroshi Morioka Dept of Computer Science University of Helsinki, Finland Facebook AI Summit, 13th June 2016
Abstract How to extract features from multi-dimensional data when there are no labels (unsupervised)? We consider data with temporal structure We learn features than enable discriminating data from different time segments (taking segment labels as class labels) We use ordinary neural networks with multinomial logistic regression: Last hidden layer gives the features Surprising theoretical result: Learns to estimate a nonlinear ICA model with general nonlinear mixing x(t) = f(s(t)). nonstationary components si (t)
Background: Need for generative models like ICA Unsupervised deep learning is a largely unsolved problem Important since labels often difficult (costly) to obtain Most approaches heuristic, not very clear what they are doing Best would be to define a generative model, and estimate it Cf. Linear unsupervised learning: independent component analysis (ICA) / sparse coding: generative models which are well-defined, i.e. identifiable (Darmois-Skitovich around 1950; Comon, 1994) If we define and estimate generative models: we know better what we are doing we can use all the theory of probabilistic methods... but admittedly, it is theoretically more challenging
Background: Nonlinear ICA may not be well-defined For random vector x, it is easy to assume a nonlinear generative model x = f(s) (1) with mutually independent hidden/latent components s i. However, not identifiable i.e. many different nonlinear transforms of x give independent components: no guarantee we can recover the original s i if we assume data with no temporal structure, and general smooth invertible nonlinearities f (Darmois, 1952; Hyvärinen and Pajunen, 1999) Nevertheless, estimation attempted by many authors, e.g. Tan-Zurada (2001), Almeida (2003) and recent deep learning work (Dinh et al, 2015)
Background: Temporal correlations can help Harmeling et al (2003) suggested using temporal structure find features that change as slowly as possible (Földiák, 1991) x s they used kernel-based models of nonlinearities Well-known idea in linear ICA (source separation) literature (Tong et al 1991; Belouchrani, 1997) In linear case, identifiable if autocorrelations distinct for different sources (a rather strict condition!) In nonlinear case, identifiability unknown, but certainly not better than in linear case!
Background: Temporal structure as nonstationarity A less-known principle in linear source separation: Sources are nonstationary (Matsuoka et al, 2005) x s Usually, we assume variances of the sources change in time s i (t) N (0, σ i (t) 2 ) (2) Linear model x(t) = As(t) is identifiable under weak assumptions (Pham and Cardoso, 2001) So far, not used in nonlinear case...
: Intuitive motivation Assume we are given an n-dimensional time series, x(t), with t time index Divide the time series (arbitrarily) into k segments (e.g. bins with equal sizes, 100 1000 points in each segment) Train a multi-layer perceptron to discriminate between segments Number of classes k, index of segment is class label Use multinomial regression, well-known algorithms/software Classifier should find a good representation in hidden layers: In particular, regarding nonstationarity Turns unsupervised learning into supervised, cf. noise-contrastive estimation or generative adversarial nets.
Theorem: TCL estimates nonlinear nonstationary ICA Assume data follows nonlinear ICA model x(t) = f(s(t)) with independent sources s i (t) with nonstationary variances i.e. s i (t) N (0, σ i (τ)) in segment τ smooth, invertible nonlinear mixing f : R n R n (+technical assumptions on the non-degeneracy of σi (τ))
Theorem: TCL estimates nonlinear nonstationary ICA Assume data follows nonlinear ICA model x(t) = f(s(t)) with independent sources s i (t) with nonstationary variances i.e. s i (t) N (0, σ i (τ)) in segment τ smooth, invertible nonlinear mixing f : R n R n (+technical assumptions on the non-degeneracy of σi (τ)) Assume we apply time-contrastive learning on x(t) i.e. logistic regression to discriminate between time segments using MLP with last hidden layer outputs in vector h(x(t)).
Theorem: TCL estimates nonlinear nonstationary ICA Assume data follows nonlinear ICA model x(t) = f(s(t)) with independent sources s i (t) with nonstationary variances i.e. s i (t) N (0, σ i (τ)) in segment τ smooth, invertible nonlinear mixing f : R n R n (+technical assumptions on the non-degeneracy of σi (τ)) Assume we apply time-contrastive learning on x(t) i.e. logistic regression to discriminate between time segments using MLP with last hidden layer outputs in vector h(x(t)). Then, s(t) 2 = Ah(x(t)) for some linear mixing matrix A. (Squaring is element-wise)
Theorem: TCL estimates nonlinear nonstationary ICA Assume data follows nonlinear ICA model x(t) = f(s(t)) with independent sources s i (t) with nonstationary variances i.e. s i (t) N (0, σ i (τ)) in segment τ smooth, invertible nonlinear mixing f : R n R n (+technical assumptions on the non-degeneracy of σi (τ)) Assume we apply time-contrastive learning on x(t) i.e. logistic regression to discriminate between time segments using MLP with last hidden layer outputs in vector h(x(t)). Then, s(t) 2 = Ah(x(t)) for some linear mixing matrix A. (Squaring is element-wise) I.e.: TCL demixes nonlinear ICA model up to a linear mixing (which can be estimated by linear ICA) and up to squaring.
Theorem: TCL estimates nonlinear nonstationary ICA Assume data follows nonlinear ICA model x(t) = f(s(t)) with independent sources s i (t) with nonstationary variances i.e. s i (t) N (0, σ i (τ)) in segment τ smooth, invertible nonlinear mixing f : R n R n (+technical assumptions on the non-degeneracy of σi (τ)) Assume we apply time-contrastive learning on x(t) i.e. logistic regression to discriminate between time segments using MLP with last hidden layer outputs in vector h(x(t)). Then, s(t) 2 = Ah(x(t)) for some linear mixing matrix A. (Squaring is element-wise) I.e.: TCL demixes nonlinear ICA model up to a linear mixing (which can be estimated by linear ICA) and up to squaring. This is a constructive proof of identifiability (up to squaring)
Illustration and comments Source signals Observed signals 1 n 1 n Segments: 1 2 3 T Nonlinear mixture: Theorem 1 Feature values 1 m Predictions of segment labels 1 12 23 3 4 T Multinomial logistic regression: Feature extractor: T Time ( ) A Generative model B Nonstationarity enables identifiability, since independence of sources must hold for all time points enough constraints Many data sets well known to be nonstationary: Video, EEG/MEG, financial time series We can generalize nonstationarity to exponential family We can combine with dimension reduction: find only nonstationary manifold
Sketch of proof of Theorem Denote h, hidden unit outputs; x, data; w τ, LR coeffs in segment τ; p τ, probability in segment τ. By theory of logistic regression, we learn differences of log-pdf s in classes: w T τ h(x t ) + b τ = log p τ (x t ) log p 1 (x t ) + const, (3) By the nonlinear ICA model, we have log p τ (x) = n λ τ,i si 2 + log det Jg(x) log Z(λ τ ), (4) i=1 where J is Jacobian of nonlinear mixing f. So, the si 2 and the h i (x t ) span the same subspace the si 2 are linear transformations of hidden units
Simulations with artificial data Create data according to model, try to recover sources. Nonlinear mixing is by another MLP; segment length 512 points. Mean correlation Recovery of sources 1 TCL(L=1) TCL(L=2) TCL(L=3) 0.8 TCL(L=4) TCL(L=5) NSVICA(L=1) 0.6 NSVICA(L=2) NSVICA(L=3) NSVICA(L=4) 0.4 NSVICA(L=5) ktdsep(l=1) ktdsep(l=2) 0.2 ktdsep(l=3) ktdsep(l=4) ktdsep(l=5) 0 DAE(L=1) 8 16 32 64 128 256 512 DAE(L=2) Number of segments DAE(L=3) Accuracy (%) Classification accuracy 100 80 40 20 10 8 4 2 1 8 16 32 64 128 256 512 Number of segments ktdsep: Harmeling et el (2003) DAE: Denoising autoencoder NSVICA: Linear nonstationarity-based method L=1 L=2 L=3 L=4 L=5 L=1(chance) L=2(chance) L=3(chance) L=4(chance) L=5(chance)
Experiments with brain imaging data MEG data (like EEG but better) Sources estimated from resting data (no stimulation) a) Validation by classifying another data set with four stimulation modalities: visual, auditory, tactile, rest. Trained a linear SVM on estimated sources Number of layers in MLP ranging from 1 to 4 b) Attempt to visualize nonlinear processing a) Classification accuracy (%) 50 40 30 L=1 L=4 L=1 L=4 TCL DAE ktdsep NSVICA b) L3 L2 L1 Figure 3: Real MEG data. a) Classification accuracies of linear SMVs newly trained with tasksession data to predict stimulation labels in task-sessions, with feature extractors trained in advance with resting-session data. Error Aapo bars Hyvärinen give standard Time-contrastive errors of thelearning mean across ten repetitions. For TCL
Conclusion We proposed the intuitive idea of time-contrastive learning Divide multivariate time series into segments, learn to discriminate them, e.g. by ordinary MLP (deep) learning Unsupervised learning via supervised learning No new algorithms or software needed TCL can be shown to estimate a nonlinear ICA model With general (smooth, invertible) nonlinear mixing functions Assuming sources are nonstationary (Note: Likelihood or mutual information of nonlinear ICA model would be much more difficult to compute) First case of nonlinear ICA (or source separation) with general identifiability results!! (?) Future work: Application on image/video data etc. Combining nonstationarity with autocorrelations