Robust regression and non-linear kernel methods for characterization of neuronal response functions from limited data

Similar documents
Neural Encoding Models

Probabilistic & Unsupervised Learning

Nonlinear reverse-correlation with synthesized naturalistic noise

Population Coding. Maneesh Sahani Gatsby Computational Neuroscience Unit University College London

What is the neural code? Sekuler lab, Brandeis

The unified maximum a posteriori (MAP) framework for neuronal system identification

Gaussian Process Regression

Statistical models for neural encoding

Exercises. Chapter 1. of τ approx that produces the most accurate estimate for this firing pattern.

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

encoding and estimation bottleneck and limits to visual fidelity

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Comparing integrate-and-fire models estimated using intracellular and extracellular data 1

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Regression, Ridge Regression, Lasso

CHARACTERIZATION OF NONLINEAR NEURON RESPONSES

A Monte Carlo Sequential Estimation for Point Process Optimum Filtering

The Gaussian distribution

SPIKE TRIGGERED APPROACHES. Odelia Schwartz Computational Neuroscience Course 2017

Describing Spike-Trains

CIS 520: Machine Learning Oct 09, Kernel Methods

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Data assimilation with and without a model

Introduction to SVM and RVM

STA414/2104 Statistical Methods for Machine Learning II

20: Gaussian Processes

CHARACTERIZATION OF NONLINEAR NEURON RESPONSES

GWAS V: Gaussian processes

Estimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction

Support Vector Machine (SVM) and Kernel Methods

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Neural Encoding: Firing Rates and Spike Statistics

Estimation of information-theoretic quantities

Variational Principal Components

Density Estimation. Seungjin Choi

Characterization of Nonlinear Neuron Responses

BAYESIAN DECISION THEORY

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Tutorial on Machine Learning for Advanced Electronics

Oslo Class 4 Early Stopping and Spectral Regularization

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Exercise Sheet 4: Covariance and Correlation, Bayes theorem, and Linear discriminant analysis

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

State-Space Methods for Inferring Spike Trains from Calcium Imaging

Event-related fmri. Christian Ruff. Laboratory for Social and Neural Systems Research Department of Economics University of Zurich

1/12/2017. Computational neuroscience. Neurotechnology.

Support Vector Machine (SVM) and Kernel Methods

Neural Coding: Integrate-and-Fire Models of Single and Multi-Neuron Responses

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Linear and Non-Linear Responses to Dynamic Broad-Band Spectra in Primary Auditory Cortex

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References

Linear Models for Regression

Lecture : Probabilistic Machine Learning

Inferring input nonlinearities in neural encoding models

The homogeneous Poisson process

Bayesian Machine Learning

Linear & nonlinear classifiers

Introduction to Machine Learning

Relevance Vector Machines for Earthquake Response Spectra

Introduction to Machine Learning

Machine learning - HT Basis Expansion, Regularization, Validation

Support Vector Machine (SVM) and Kernel Methods

Introduction to Gaussian Process

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear & nonlinear classifiers

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 6. Regression

Machine Learning Techniques

Resampling techniques for statistical modeling

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

Bayesian inference for low rank spatiotemporal neural receptive fields

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

The Bayesian Brain. Robert Jacobs Department of Brain & Cognitive Sciences University of Rochester. May 11, 2017

ECE521 week 3: 23/26 January 2017

Probabilistic & Unsupervised Learning

Review: Support vector machines. Machine learning techniques and image analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Encoding or decoding

Data assimilation with and without a model

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35

A tailor made nonparametric density estimate

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Spatial Process Estimates as Smoothers: A Review

Overfitting, Bias / Variance Analysis

y(x) = x w + ε(x), (1)

GAUSSIAN PROCESS REGRESSION

Model Selection for Gaussian Processes

Statistical Pattern Recognition

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Transcription:

Robust regression and non-linear kernel methods for characterization of neuronal response functions from limited data Maneesh Sahani Gatsby Computational Neuroscience Unit University College, London Jennifer Linden Keck Center for Integrative Neurosciences University of California at San Francsico

Studying perceptual systems x(t) y(t) Decoding: ˆx(t)= G[y(t)] (reconstruction) Encoding: ŷ(t)= F [x(t)] (systems identification)

Spike-triggered average Decoding: mean of P (x y = 1) Encoding: predictive filter

Wiener Filtering is Linear regression y(t) = T x(t τ)w(τ)dτ x 1 x 2 x 3... x T x T +1... } x 1 x 2 x{{ 3... x T } x 1 } x 2 x 3.{{.. x T x T } x 1 x 2 x 3... x T +1 x 2 x 3 x 4... x T +1. w t. w 3 w 2 w 1 = y T y T +1. XW = Y W (ω) = X(ω) Y (ω) X(ω) 2 W = (X X) }{{} Σ SS 1 (X Y ) }{{} STA

Overfitting Maximum-likelihood estimates often overfit to noise in the training data. Overfitting is a fundamental problem in data modelling. It can never be completely overcome. Even the correct model, with the correct priors will overfit. One common signature of overfitting in (quasi- )linear models is the appearance of large weights of opposite signs: the difference has been tuned by likelihood optimization to model the training data. Such solutions are also often unstable: change the data a little and the weights fluctuate alarmingly. Such overfitting is often combatted by penalizing large weights: called weight decay in the neural-network literature, ridge regression in the statistics literature, and equivalent to a prior distribution centered on zero weight in the Bayesian literature. 2 1.5 1.5.5 N D fit with spurious (random) inputs 1.1.2.3.4.5.6.7.8.9 1 1 D fit data true relationship

ARD Maximum-likelihood regression can be improved by a Bayesian analysis called Automatic Relevance Determination (ARD) due to MacKay and Neal. Assume 2 w i N (, 1/α i ) independently 1.5 data and define L 2 ({α i }) = Optimize L 2 with respect to {α i } (ML-2) dw 1... dw D P (Y X; W ) P (W {α }{{} i }) L(W ) α i w i = irrelevant α i finite w i = argmax P (w i X, Y, α i ) relevant 1.5.5 1.1.2.3.4.5.6.7.8.9 1 N D fit with ARD true rela

Simulated simple cell actual filter maximum likelihood (ML) ridge regression (MAP) normalized power 1.8.6.4.2 automatic relevance detection (ARD) ML MAP ARD Predictive power Excess predicted power

Rat auditory cortex cell maximum likelihood 1 1.5 6.361 x 1 3 5 frequency (khz) normalized power 1.5 ML ARD 4.24 2.12 average power 24 18 12 6 25 time (ms) ARD 1 Predictive power Excess predicted power 5 24 18 12 6 25 time (ms) frequency (khz) R216182G/21911/pen12loc2poisshical8a

Population predictive performance Tested on new responses to the same stimuli. normalized ARD predictive power.6.4.2.2.2.2.4.6 normalized ML predictive power.2 1 2 3 4 normalized noise power.3.2.1 1 2.1 number of recordings normalized predictive power difference (ARD ML) Parameters can still overfit to the particular stimulus used.

Population predictive performance Tested on responses to new stimuli. normalized ARD predictive power.5.5.5.5 normalized ML predictive power.6.5.4.3.2.1.1 1 2 3 4 normalized noise power.6.5.4.3.2.1 5 1.1 number of recordings normalized predictive power difference (ARD ML)

Evaluating constrained linear models by prediction

Separating signal power from noise y(t) = µ(t) + η(t) }{{}}{{} signal noise Taking powers: ( P (y(t)) = (y(t) y ) 2 ) The maximum possible reduction in variance is given by the power of the stimuluslocked signal. Repeated trials that use the same stimulus allow us to separate signal from noise. The estimator is unbiased, and has a computable variance, provided the noise distribution is not pathalogical. This is an estimated bound, but will not be practicably achievable because any model (even the right one!) will overfit given limited data. P (y) }{{} observed power Averaging N trials = P (µ) + σ }{{} 2 }{{} signal power average noise varia P (y) = P (µ) + σ2 N Thus NP (y) P (y) P (µ) = N 1 Which is an unbiased estimator. Its variance is given by [ ] Var P = 4 ( 1 N T 2 µ Σµ 2 ) T µσ µ + µσµ + 2 N(N 1) ( 1 T 2 Tr [ΣΣ] 2 T σ σ + σ 2 ) where Σ, σ and σ are statistics describing the (co)variance of the noise and must themselves be estimated from data.

Non-parametric predictability estimates 1.2.1975 1.1646 Non-parametric prediction can provide an estimated lower bound on predictable power. The cell response function is not modelled explicitly. Instead, responses are predicted based on observations made on similar stimuli. The simplest approach is to average the training data over trials. However, this will overfit to the noise severely. Also, it does not allow prediction on new stimuli. An alternative is to smooth the average with reference to the stimulus. This is sometimes known as locally-weighted regression or Parzen-window regression..8.6.4.2 2 1.8 1.6 1.4 1.2 1.8.6.4.2.1317.988.658.329 linear ML linear ARD average R211182G/21731/pen14loc2poisshical2.495.445.396.346.297.247.198.148.99.49 linear ML linear ARD average parzen window R217251G/21922/pen7loc2poisshical3

Nonlinearities the Volterra Series The usual extension to non-linear functionals involves a power series in x(t). ŷ(t) = G[x] = g + dτ 1 g 1 (τ 1 )x(t τ 1 ) + dτ 1 dτ 2 g 2 (τ 1, τ 2 )x(t τ 1 )x(t τ 2 ) +... G(x t ) = g + x t g 1 + x t g 2 x t + = g + x t g 1 + x t x t g 2 + g = 1 } x t {{ x t x t } φ V (x t ) g 1 g 2 w G(x t ) = φ V (x t ) w = y t Φ V (X)W = Y

Thus, higher order terms can be estimated by linear regression in an augmented space.

Simulated complex cell actual filter maximum likelihood (ML) ridge regression (MAP) 2 normalized power 1.5 1.5 automatic relevance determination (ARD) ML MAP ARD

Kernel methods x i φ : X Φ φ(x i ) X Φ For certain embeddings φ we can find a kernel function K : X X IR such that K(x 1, x 2 ) = φ(x 1 ) φ(x 2 ) Thus linear operations in Φ can be performed without explicit embedding. Φ can be very high-dimensional (even ) without intractibility. Only certain K have an associated φ and vice versa (Mercer).

Kernel regression Linear regression in the augmented space requires finding a weight vector γ to solve Φ(X) γ = Y If γ = φ(x γ ) we could write K(x t, x γ ) = y t but this is nonlinear in x γ and the restriction γ φ(x) Φ is too stringent. Instead take γ = i w i φ(x i ) giving w i K(x t, x i ) = y t or KW = Y i If {x i } = {x t } this representation is complete because any additional part of γ would lie in the null space of Φ. In practice, we may choose {x i } {x t } randomly, or by a greedy process.

Volterra space and the polynomial kernels The Volterra augmented space has a corresponding kernel function. Define K p(1) (x 1, x 2 )= x 1 x 2 K p(2) (x 1, x 2 )= (x 1 x 2 ) 2 ( ) 2 = x i 1x i 2 = x i 1x i 2x j 1 xj 2 = i ij ij = (x 1 x 1)(:) (x 2 x 2)(:). (x i 1x j 1 )(xi 2x j 2 ) then the kernel corresponds to the Volterra augmentation K V (x 1, x 2 )= K p(1) (x 1, x 2 ) + K p(2) (x 1, x 2 ) +... φ V (x)= x(:) xx (:). (Also useful is the kernel K p(n) (x 1, x 2 ) = (1 + x 1, x 2 ) n )

Other kernels Volterra expansion polynomial kernels Other kernels can provide other non-linear expansions of the transform; some with useful intuitive interpretations. For example, K σ (x 1, x 2 ) = e x 1 x 2 2 /2σ 2 y t = i w i e x t x i 2 /2σ 2 ARD will identify a small number of characteristic vectors x i, providing a tiled approximation to the nonlinearity.

Simulated complex cell 3.7e+3 2e+2 1.7e+2 14 9 7 1.7 3.5 33 42 66 1.2e+2 1.9e+2 2.2e+2 2.2e+2 2.7e+2 3.8e+2 4.3e+2 4.6e+2 5.6e+2 5.7e+2 6.8e+2

Input non-linearities Σ

Hinge weights example 1 1.2469 linard 1 5 24 18 12 6 25 time (ms) frequency (khz).8.1646.4 linear (ML) hinge (ARD) linear (ARD).823 hinge (ARD) 1 5 25 1 5 24 18 12 6 25 time (ms) frequency (khz) frequency (khz) R211182G/21731/pen14loc2poisshical2

Hinge weights example 2 1.5 1.5 linear (ML) hinge (ARD) linear (ARD).124.83.41 linear (ARD) 1 5 24 18 12 6 25 time (ms) hinge (ARD) 1 5 25 1 5 24 18 12 6 25 time (ms) frequency (khz) frequency (khz) frequency (khz) J21492G/21717/pen8loc3poisshical8

Conclusions The linear regression viewpoint for both linear and nonlinear finite-impulse systems identification permits the straightforward application of advanced regression techniques. In particular, ARD allows us to restrict the range of the filter to the relevant input subspace. The standard Volterra expansion for non-linear functions is a special case of the general kernel regression method. Other kernels can provide equally, if not more, interpretable representations of the nonlinearity, particularly when combined with ARD.