Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA

Similar documents
arxiv: v1 [stat.ml] 20 May 2016

CIFAR Lectures: Non-Gaussian statistics and natural images

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Nonlinear ICA of Temporally Dependent Stationary Sources

Estimating Unnormalized models. Without Numerical Integration

Independent Component Analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Basic Principles of Unsupervised and Unsupervised

Advanced Introduction to Machine Learning CMU-10715

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

A Deep Interpretation of Classifier Chains

Notes on Noise Contrastive Estimation (NCE)

Independent Component Analysis

Linear Factor Models. Deep Learning Decal. Hosted by Machine Learning at Berkeley

New Machine Learning Methods for Neuroimaging

Natural Image Statistics

Pattern Recognition and Machine Learning

STA 414/2104: Lecture 8

Independent Subspace Analysis

Bayesian ensemble learning of generative models

Blind Machine Separation Te-Won Lee

Learning features by contrasting natural images with noise

CSC321 Lecture 20: Autoencoders

Dreem Challenge report (team Bussanati)

Statistical Learning Reading Assignments

NONLINEAR BLIND SOURCE SEPARATION USING KERNEL FEATURE SPACES.

Semi-Blind approaches to source separation: introduction to the special session

Kernel Feature Spaces and Nonlinear Blind Source Separation

From independent component analysis to score matching

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Reading Group on Deep Learning Session 1

Tensor intro 1. SIAM Rev., 51(3), Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W.,

Linear Factor Models. Sargur N. Srihari

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Multivariate statistical methods and data mining in particle physics

EXTENSIONS OF ICA AS MODELS OF NATURAL IMAGES AND VISUAL PROCESSING. Aapo Hyvärinen, Patrik O. Hoyer and Jarmo Hurri

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

UNSUPERVISED LEARNING

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

Feature Extraction with Weighted Samples Based on Independent Component Analysis

A two-layer ICA-like model estimated by Score Matching

Course in Data Science

CSCI-567: Machine Learning (Spring 2019)

Introduction to Neural Networks

ICA [6] ICA) [7, 8] ICA ICA ICA [9, 10] J-F. Cardoso. [13] Matlab ICA. Comon[3], Amari & Cardoso[4] ICA ICA

SPARSE REPRESENTATION AND BLIND DECONVOLUTION OF DYNAMICAL SYSTEMS. Liqing Zhang and Andrzej Cichocki

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Blind separation of sources that have spatiotemporal variance dependencies

MISEP Linear and Nonlinear ICA Based on Mutual Information

Temporal Coherence, Natural Image Sequences, and the Visual Cortex

Tensor Methods for Feature Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Classification objectives COMS 4771

Independent Component Analysis and Blind Source Separation

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces

ICA. Independent Component Analysis. Zakariás Mátyás

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

Non-linear Measure Based Process Monitoring and Fault Diagnosis

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Reward-modulated inference

Independent Component Analysis. Contents

Artificial Intelligence

Information Dynamics Foundations and Applications

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso

BLIND SEPARATION OF INSTANTANEOUS MIXTURES OF NON STATIONARY SOURCES

Analytical solution of the blind source separation problem using derivatives

NONLINEAR INDEPENDENT FACTOR ANALYSIS BY HIERARCHICAL MODELS

Blind Source Separation in Nonlinear Mixture for Colored Sources Using Signal Derivatives

Probabilistic Machine Learning. Industrial AI Lab.

STA 414/2104: Lecture 8

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Learning from Data: Multi-layer Perceptrons

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Principal Component Analysis

CS534 Machine Learning - Spring Final Exam

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Neural Networks. Nethra Sambamoorthi, Ph.D. Jan CRMportals Inc., Nethra Sambamoorthi, Ph.D. Phone:

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Learning Deep Architectures for AI. Part II - Vijay Chakilam

DISTINGUISH HARD INSTANCES OF AN NP-HARD PROBLEM USING MACHINE LEARNING

Domain-Adversarial Neural Networks

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

Higher Order Statistics

Machine Learning for Signal Processing Bayes Classification and Regression

Transcription:

Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA with Hiroshi Morioka Dept of Computer Science University of Helsinki, Finland Facebook AI Summit, 13th June 2016

Abstract How to extract features from multi-dimensional data when there are no labels (unsupervised)? We consider data with temporal structure We learn features than enable discriminating data from different time segments (taking segment labels as class labels) We use ordinary neural networks with multinomial logistic regression: Last hidden layer gives the features Surprising theoretical result: Learns to estimate a nonlinear ICA model with general nonlinear mixing x(t) = f(s(t)). nonstationary components si (t)

Background: Need for generative models like ICA Unsupervised deep learning is a largely unsolved problem Important since labels often difficult (costly) to obtain Most approaches heuristic, not very clear what they are doing Best would be to define a generative model, and estimate it Cf. Linear unsupervised learning: independent component analysis (ICA) / sparse coding: generative models which are well-defined, i.e. identifiable (Darmois-Skitovich around 1950; Comon, 1994) If we define and estimate generative models: we know better what we are doing we can use all the theory of probabilistic methods... but admittedly, it is theoretically more challenging

Background: Nonlinear ICA may not be well-defined For random vector x, it is easy to assume a nonlinear generative model x = f(s) (1) with mutually independent hidden/latent components s i. However, not identifiable i.e. many different nonlinear transforms of x give independent components: no guarantee we can recover the original s i if we assume data with no temporal structure, and general smooth invertible nonlinearities f (Darmois, 1952; Hyvärinen and Pajunen, 1999) Nevertheless, estimation attempted by many authors, e.g. Tan-Zurada (2001), Almeida (2003) and recent deep learning work (Dinh et al, 2015)

Background: Temporal correlations can help Harmeling et al (2003) suggested using temporal structure find features that change as slowly as possible (Földiák, 1991) x s they used kernel-based models of nonlinearities Well-known idea in linear ICA (source separation) literature (Tong et al 1991; Belouchrani, 1997) In linear case, identifiable if autocorrelations distinct for different sources (a rather strict condition!) In nonlinear case, identifiability unknown, but certainly not better than in linear case!

Background: Temporal structure as nonstationarity A less-known principle in linear source separation: Sources are nonstationary (Matsuoka et al, 2005) x s Usually, we assume variances of the sources change in time s i (t) N (0, σ i (t) 2 ) (2) Linear model x(t) = As(t) is identifiable under weak assumptions (Pham and Cardoso, 2001) So far, not used in nonlinear case...

: Intuitive motivation Assume we are given an n-dimensional time series, x(t), with t time index Divide the time series (arbitrarily) into k segments (e.g. bins with equal sizes, 100 1000 points in each segment) Train a multi-layer perceptron to discriminate between segments Number of classes k, index of segment is class label Use multinomial regression, well-known algorithms/software Classifier should find a good representation in hidden layers: In particular, regarding nonstationarity Turns unsupervised learning into supervised, cf. noise-contrastive estimation or generative adversarial nets.

Theorem: TCL estimates nonlinear nonstationary ICA Assume data follows nonlinear ICA model x(t) = f(s(t)) with independent sources s i (t) with nonstationary variances i.e. s i (t) N (0, σ i (τ)) in segment τ smooth, invertible nonlinear mixing f : R n R n (+technical assumptions on the non-degeneracy of σi (τ))

Theorem: TCL estimates nonlinear nonstationary ICA Assume data follows nonlinear ICA model x(t) = f(s(t)) with independent sources s i (t) with nonstationary variances i.e. s i (t) N (0, σ i (τ)) in segment τ smooth, invertible nonlinear mixing f : R n R n (+technical assumptions on the non-degeneracy of σi (τ)) Assume we apply time-contrastive learning on x(t) i.e. logistic regression to discriminate between time segments using MLP with last hidden layer outputs in vector h(x(t)).

Theorem: TCL estimates nonlinear nonstationary ICA Assume data follows nonlinear ICA model x(t) = f(s(t)) with independent sources s i (t) with nonstationary variances i.e. s i (t) N (0, σ i (τ)) in segment τ smooth, invertible nonlinear mixing f : R n R n (+technical assumptions on the non-degeneracy of σi (τ)) Assume we apply time-contrastive learning on x(t) i.e. logistic regression to discriminate between time segments using MLP with last hidden layer outputs in vector h(x(t)). Then, s(t) 2 = Ah(x(t)) for some linear mixing matrix A. (Squaring is element-wise)

Theorem: TCL estimates nonlinear nonstationary ICA Assume data follows nonlinear ICA model x(t) = f(s(t)) with independent sources s i (t) with nonstationary variances i.e. s i (t) N (0, σ i (τ)) in segment τ smooth, invertible nonlinear mixing f : R n R n (+technical assumptions on the non-degeneracy of σi (τ)) Assume we apply time-contrastive learning on x(t) i.e. logistic regression to discriminate between time segments using MLP with last hidden layer outputs in vector h(x(t)). Then, s(t) 2 = Ah(x(t)) for some linear mixing matrix A. (Squaring is element-wise) I.e.: TCL demixes nonlinear ICA model up to a linear mixing (which can be estimated by linear ICA) and up to squaring.

Theorem: TCL estimates nonlinear nonstationary ICA Assume data follows nonlinear ICA model x(t) = f(s(t)) with independent sources s i (t) with nonstationary variances i.e. s i (t) N (0, σ i (τ)) in segment τ smooth, invertible nonlinear mixing f : R n R n (+technical assumptions on the non-degeneracy of σi (τ)) Assume we apply time-contrastive learning on x(t) i.e. logistic regression to discriminate between time segments using MLP with last hidden layer outputs in vector h(x(t)). Then, s(t) 2 = Ah(x(t)) for some linear mixing matrix A. (Squaring is element-wise) I.e.: TCL demixes nonlinear ICA model up to a linear mixing (which can be estimated by linear ICA) and up to squaring. This is a constructive proof of identifiability (up to squaring)

Illustration and comments Source signals Observed signals 1 n 1 n Segments: 1 2 3 T Nonlinear mixture: Theorem 1 Feature values 1 m Predictions of segment labels 1 12 23 3 4 T Multinomial logistic regression: Feature extractor: T Time ( ) A Generative model B Nonstationarity enables identifiability, since independence of sources must hold for all time points enough constraints Many data sets well known to be nonstationary: Video, EEG/MEG, financial time series We can generalize nonstationarity to exponential family We can combine with dimension reduction: find only nonstationary manifold

Sketch of proof of Theorem Denote h, hidden unit outputs; x, data; w τ, LR coeffs in segment τ; p τ, probability in segment τ. By theory of logistic regression, we learn differences of log-pdf s in classes: w T τ h(x t ) + b τ = log p τ (x t ) log p 1 (x t ) + const, (3) By the nonlinear ICA model, we have log p τ (x) = n λ τ,i si 2 + log det Jg(x) log Z(λ τ ), (4) i=1 where J is Jacobian of nonlinear mixing f. So, the si 2 and the h i (x t ) span the same subspace the si 2 are linear transformations of hidden units

Simulations with artificial data Create data according to model, try to recover sources. Nonlinear mixing is by another MLP; segment length 512 points. Mean correlation Recovery of sources 1 TCL(L=1) TCL(L=2) TCL(L=3) 0.8 TCL(L=4) TCL(L=5) NSVICA(L=1) 0.6 NSVICA(L=2) NSVICA(L=3) NSVICA(L=4) 0.4 NSVICA(L=5) ktdsep(l=1) ktdsep(l=2) 0.2 ktdsep(l=3) ktdsep(l=4) ktdsep(l=5) 0 DAE(L=1) 8 16 32 64 128 256 512 DAE(L=2) Number of segments DAE(L=3) Accuracy (%) Classification accuracy 100 80 40 20 10 8 4 2 1 8 16 32 64 128 256 512 Number of segments ktdsep: Harmeling et el (2003) DAE: Denoising autoencoder NSVICA: Linear nonstationarity-based method L=1 L=2 L=3 L=4 L=5 L=1(chance) L=2(chance) L=3(chance) L=4(chance) L=5(chance)

Experiments with brain imaging data MEG data (like EEG but better) Sources estimated from resting data (no stimulation) a) Validation by classifying another data set with four stimulation modalities: visual, auditory, tactile, rest. Trained a linear SVM on estimated sources Number of layers in MLP ranging from 1 to 4 b) Attempt to visualize nonlinear processing a) Classification accuracy (%) 50 40 30 L=1 L=4 L=1 L=4 TCL DAE ktdsep NSVICA b) L3 L2 L1 Figure 3: Real MEG data. a) Classification accuracies of linear SMVs newly trained with tasksession data to predict stimulation labels in task-sessions, with feature extractors trained in advance with resting-session data. Error Aapo bars Hyvärinen give standard Time-contrastive errors of thelearning mean across ten repetitions. For TCL

Conclusion We proposed the intuitive idea of time-contrastive learning Divide multivariate time series into segments, learn to discriminate them, e.g. by ordinary MLP (deep) learning Unsupervised learning via supervised learning No new algorithms or software needed TCL can be shown to estimate a nonlinear ICA model With general (smooth, invertible) nonlinear mixing functions Assuming sources are nonstationary (Note: Likelihood or mutual information of nonlinear ICA model would be much more difficult to compute) First case of nonlinear ICA (or source separation) with general identifiability results!! (?) Future work: Application on image/video data etc. Combining nonstationarity with autocorrelations