Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

Similar documents
Part 1: Expectation Propagation

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Decoding conceptual representations

GWAS V: Gaussian processes

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Probabilistic Machine Learning. Industrial AI Lab.

Nonparametric Bayesian Methods (Gaussian Processes)

Pattern Recognition and Machine Learning

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Improving posterior marginal approximations in latent Gaussian models

STA 4273H: Sta-s-cal Machine Learning

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Sparse Bayesian structure learning with dependent relevance determination prior

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

ECE521 week 3: 23/26 January 2017

Sparse Linear Models (10/7/13)

Lecture 1: Bayesian Framework Basics

Introduction to Gaussian Process

Bayesian Deep Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

STA 4273H: Statistical Machine Learning

L 0 methods. H.J. Kappen Donders Institute for Neuroscience Radboud University, Nijmegen, the Netherlands. December 5, 2011.

Density Estimation. Seungjin Choi

Neutron inverse kinetics via Gaussian Processes

Gaussian Processes (10/16/13)

Statistical learning. Chapter 20, Sections 1 4 1

Expectation propagation as a way of life

GWAS IV: Bayesian linear (variance component) models

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

M/EEG source analysis

The Variational Gaussian Approximation Revisited

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

1 Bayesian Linear Regression (BLR)

Bayesian inference J. Daunizeau

CS281 Section 4: Factor Analysis and PCA

Gaussian Processes in Machine Learning

Nonparameteric Regression:

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Probabilistic & Unsupervised Learning

Introduction to Gaussian Processes

Linear Dynamical Systems

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Tales from fmri Learning from limited labeled data. Gae l Varoquaux

Support Vector Machines

Bayesian Regression Linear and Logistic Regression

y(x) = x w + ε(x), (1)

CPSC 540: Machine Learning

Kernel methods, kernel SVM and ridge regression

Generative Models for Discrete Data

MIXED EFFECTS MODELS FOR TIME SERIES

Bayesian inference J. Daunizeau

Active and Semi-supervised Kernel Classification

Variational Autoencoder

Bayesian Inference Course, WTCN, UCL, March 2013

Model Selection for Gaussian Processes

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Least Squares Regression

Naïve Bayes classification

Algorithmisches Lernen/Machine Learning

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

GAUSSIAN PROCESS REGRESSION

Probability and Estimation. Alan Moses

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

STA 414/2104, Spring 2014, Practice Problem Set #1

Machine Learning 4771

The Bayesian approach to inverse problems

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Machine learning strategies for fmri analysis

Bayesian probability theory and generative models

Regularized Regression A Bayesian point of view

STA 4273H: Statistical Machine Learning

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Bayesian Support Vector Machines for Feature Ranking and Selection

Probabilistic Graphical Models

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Bayesian modelling of fmri time series

Probabilistic Graphical Models for Image Analysis - Lecture 1

Generative Clustering, Topic Modeling, & Bayesian Inference

Probabilistic Graphical Models

Bayesian Machine Learning

NeuroImage. Efficient Bayesian multivariate fmri analysis using a sparsifying spatio-temporal prior

Introduction to Probabilistic Graphical Models: Exercises

Approximate Inference Part 1 of 2

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

STA 414/2104: Lecture 8

Advanced Introduction to Machine Learning CMU-10715

Approximate Inference Part 1 of 2

Graphical Models for Collaborative Filtering

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Nonparametric Bayes tensor factorizations for big data

CSC321 Lecture 18: Learning Probabilistic Models

Probabilistic Reasoning in Deep Learning

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Introduction Generalization Overview of the main methods Resources. Pattern Recognition. Bertrand Thirion and John Ashburner

Transcription:

Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior Tom Heskes joint work with Marcel van Gerven and Botond Cseke Machine Learning Group, Institute for Computing and Information Sciences Radboud University Nijmegen, The Netherlands April 15, 2015

Outline Bayesian Linear Models Large p, small N Lasso vs. ridge regression Bayesian interpretation Multivariate sparsifying priors Motivation: brain reading Scale mixture models Multi-variate extensions Approximate inference Experiments fmri classification MEG source localization Conclusions

Outline Bayesian Linear Models Large p, small N Lasso vs. ridge regression Bayesian interpretation Multivariate sparsifying priors Motivation: brain reading Scale mixture models Multi-variate extensions Approximate inference Experiments fmri classification MEG source localization Conclusions

Large p, small N Many datasets grow wide, with many more features than samples. Neuroimaging: p = 20K pixels, N = 100 experiments/subjects. Document classification: p = 20K features (bag of words), N = 5K documents. Micro-array studies: p = 40K genes measured for N = 100 subjects. In all of these we use linear models, e.g., linear regression, logistic regression, that we have to regularize when p N.

The Lasso Given observations {y i, x i1,..., x ip } N i=1, minimize N E(θ) = y i θ 0 i=1 j=1 p x ij θ j 2 + γ j θ j. Lasso Similar to ridge regression, with penalty j θ2 j. Lasso leads to sparse solutions (many θ i s to zero), whereas ridge regression only shrinks. Ridge regression

Probabilistic Interpretation Likelihood: N p(y θ, X ) = N (y i ; θ T x i, β 1 ). i=1 Prior, for Lasso, Laplace prior p p(θ) = L(θ j ; 0, λ), j=1 whereas for ridge regression, Gaussian prior p p(θ) = N (θ j ; 0, λ 2 ). j=1

Posterior Distribution Bayes rule yields the posterior distribution p(θ Y, X ) p(y θ, X )p(θ). The solution of Lasso/ridge regression corresponds to the maximum a posteriori solution. The Bayesian framework provides a principled approach for computing errorbars, optimizing hyperparameters, incorporating prior knowledge, experiment selection,...

Outline Bayesian Linear Models Large p, small N Lasso vs. ridge regression Bayesian interpretation Multivariate sparsifying priors Motivation: brain reading Scale mixture models Multi-variate extensions Approximate inference Experiments fmri classification MEG source localization Conclusions

Brain Reading Deduce a person s intentions by reading his brain. Brain-computer interfaces based on on-line EEG analysis. Classification of brain activity measured through fmri into prespecified categories. Assumption: if a particular voxel/electrode/frequency is relevant, then we expect its neighbors to be relevant as well. How do we incorporate such knowledge into our priors?

Scale Mixture Models The Laplace distribution (as many others) can be written as a scale mixture distribution: L(θ; 0, λ) = 0 dσ 2 E(σ 2 ; 2λ) N (θ; 0, σ 2 ). We interpret the scales σ j as the relevance or importance of feature j: higher σ j implies more important. By coupling the scales, we will couple the relevances.

Multivariate Extension Exponential distribution is equivalent to chi-square distribution in two dimensions: if {u, v} N (0, λ) then σ 2 = u 2 + v 2 E(2λ), and thus L(θ; 0, λ) = dun (u; 0, λ) dvn (v; 0, λ)n (θ; 0, u 2 + v 2 ). We define a multivariate Laplace distribution by coupling the u and the v s: L(θ; 0, Λ) = dun (u; 0, Λ) with Λ a covariance matrix. dvn (v; 0, Λ) p j=1 N (θ j ; 0, u 2 j +v 2 j ),

The Joint Posterior p(θ, u, v X, Y ) N N (y i ; x T i θ, β 1 ) i=1 } {{ } likelihood (Gaussian) p j=1 N (θ j ; 0, u 2 j + v 2 j ) }{{} couplings between coefficients and scales N (u; 0, Λ)N (v; 0, Λ). }{{} multi-variate Gaussian Interested in: mean of θ for mean predictions; co-variance of θ for errorbars; variance of u and v for relevance. By symmetry: mean of u and v are zero; co-variance of u and v are the same; u and v are uncorrelated; {u, v} and θ are uncorrelated.

Factor Graph f 1 f 2 f N z 1 z 2 z N g 1 g 2 g N θ 1 θ 2 θ p h 1 h 2 h p u 1 v 1 u 2 v 2 u p v p f i for the likelihood term corresponding to data point i. g i implements the linear constraint z i = θ T x i. h j corresponds to the coupling between regression coefficients and scales. i 1,2 represent the multi-variate Gaussians on {u, v}. i 1 i 2

Approximate Inference The joint posterior is essentially a (possibly huge) Gaussian random field with some (low-dimensional) nonlinear interaction terms. Exact inference is intractable, even with independent scales. Method of choice: expectation propagation. Approximate exact joint posterior by a multi-variate Gaussian. f1 f2 fn z1 z2 zn g1 g2 gn θ1 θ2 θp h1 h2 hp u1 v1 u2 v2 up vp ĩ1 ĩ2

Expectation Propagation f1 f2 fn f1 f2 fn z1 z2 zn z1 z2 zn g1 g2 gn θ1 θ2 θp h1 h2 hp u1 v1 u2 v2 up vp substitute = = project g1 g2 gn θ1 θ2 θp h1 h2 hp u1 v1 u2 v2 up vp ĩ1 ĩ2 ĩ1 ĩ2 Iteratively approximate non-gaussian terms by Gaussian terms. Each approximation boils down to (low-dimensional) moment matching. Some clever sparse matrix tricks make this computationally doable. Main operation: Takahashi procedure for computing the diagonal elements of its inverse. Cseke and Heskes: Journal of Machine Learning Research, 2011.

Possible Extensions Logistic regression instead of linear regression: likelihood terms have to be approximated as well; further no essential difference. Spike-and-slab prior instead of Laplace prior: scale mixture with two scales; couplings between discrete latent variables or squashed Gaussian variables; similar ideas in Hernández (2 ) & Dupont, JMLR 2013. f1 f2 fn z1 z2 zn g1 g2 gn θ1 θ2 θp h1 h2 hp u1 v1 u2 v2 up vp ĩ1 ĩ2

Outline Bayesian Linear Models Large p, small N Lasso vs. ridge regression Bayesian interpretation Multivariate sparsifying priors Motivation: brain reading Scale mixture models Multi-variate extensions Approximate inference Experiments fmri classification MEG source localization Conclusions

fmri classification fmri activations for different handwritten digits. 50 six trials and 50 nine trials, i.e., N = 100. Full dataset: 5228 voxels measured over 7 consecutive 2500 ms time steps, i.e., p 35000. Hemodynamic (BOLD) response Van Gerven, Cseke, de Lange, Typical data samples and Heskes: Neuroimage, 2010.

Spatial Importance Maps Data averaged 10 to 15 s after stimulus onset. Decoupled (top) versus spatial coupling of scale variables (bottom). Most relevant voxels in the occipital lobe (Brodmann Areas 17 and 18).

Importance over Time Importance shown for 10 most relevant voxels. Decoupled (left) versus temporal coupling of scale variables (right). Increasing importance corresponds with lagged BOLD response.

Intermezzo: Decoding fmri using DBM s unsupervised learning phase supervised learning phase reconstruction Van Gerven, de Lange, and Heskes: Neural Computation, 2010.

Reconstructions stimulus pixel level one layer two layers three layers

Source Localization Sensor readings Y are related to source currents S through Y = XS + noise X with X a known lead field matrix, corresponding to the forward model derived from a structural MRI. z y x z y x Essentially an (ill-posed) linear regression problem in which S plays the role of the regression coefficients θ. Different regression problems for different time points. N = 275, the number of sensors. p = 1K (a bit depending on the discretization), the number of sources. Van Gerven, Cseke, Oostenveld, and Heskes: NIPS, 2009.

Without Constraints Not so clear

With Spatial Constraints Sources where you d expect them to see

Outline Bayesian Linear Models Large p, small N Lasso vs. ridge regression Bayesian interpretation Multivariate sparsifying priors Motivation: brain reading Scale mixture models Multi-variate extensions Approximate inference Experiments fmri classification MEG source localization Conclusions

Conclusions and Discussion Take-home-message: A novel way to specify multi-variate sparsifying priors through scale-mixture representations. Posterior estimates of these scales can be used for relevance determination. Efficient techniques for approximate inference. Increased interpretability when analyzing neuroimaging data. Future directions: Extensions to (correlated) spike-and-slab priors. Improved stability of expectation propagation.