Robust regression and non-linear kernel methods for characterization of neuronal response functions from limited data

Robust regression and non-linear kernel methods for characterization of neuronal response functions from limited data Maneesh Sahani Gatsby Computational Neuroscience Unit University College, London Jennifer Linden Keck Center for Integrative Neurosciences University of California at San Francsico

Studying perceptual systems x(t) y(t) Decoding: ˆx(t)= G[y(t)] (reconstruction) Encoding: ŷ(t)= F [x(t)] (systems identification)

Spike-triggered average Decoding: mean of P (x y = 1) Encoding: predictive filter

Wiener Filtering is Linear regression y(t) = T x(t τ)w(τ)dτ x 1 x 2 x 3... x T x T +1... } x 1 x 2 x{{ 3... x T } x 1 } x 2 x 3.{{.. x T x T } x 1 x 2 x 3... x T +1 x 2 x 3 x 4... x T +1. w t. w 3 w 2 w 1 = y T y T +1. XW = Y W (ω) = X(ω) Y (ω) X(ω) 2 W = (X X) }{{} Σ SS 1 (X Y ) }{{} STA

Overfitting Maximum-likelihood estimates often overfit to noise in the training data. Overfitting is a fundamental problem in data modelling. It can never be completely overcome. Even the correct model, with the correct priors will overfit. One common signature of overfitting in (quasi- )linear models is the appearance of large weights of opposite signs: the difference has been tuned by likelihood optimization to model the training data. Such solutions are also often unstable: change the data a little and the weights fluctuate alarmingly. Such overfitting is often combatted by penalizing large weights: called weight decay in the neural-network literature, ridge regression in the statistics literature, and equivalent to a prior distribution centered on zero weight in the Bayesian literature. 2 1.5 1.5.5 N D fit with spurious (random) inputs 1.1.2.3.4.5.6.7.8.9 1 1 D fit data true relationship

ARD Maximum-likelihood regression can be improved by a Bayesian analysis called Automatic Relevance Determination (ARD) due to MacKay and Neal. Assume 2 w i N (, 1/α i ) independently 1.5 data and define L 2 ({α i }) = Optimize L 2 with respect to {α i } (ML-2) dw 1... dw D P (Y X; W ) P (W {α }{{} i }) L(W ) α i w i = irrelevant α i finite w i = argmax P (w i X, Y, α i ) relevant 1.5.5 1.1.2.3.4.5.6.7.8.9 1 N D fit with ARD true rela

Simulated simple cell actual filter maximum likelihood (ML) ridge regression (MAP) normalized power 1.8.6.4.2 automatic relevance detection (ARD) ML MAP ARD Predictive power Excess predicted power

Rat auditory cortex cell maximum likelihood 1 1.5 6.361 x 1 3 5 frequency (khz) normalized power 1.5 ML ARD 4.24 2.12 average power 24 18 12 6 25 time (ms) ARD 1 Predictive power Excess predicted power 5 24 18 12 6 25 time (ms) frequency (khz) R216182G/21911/pen12loc2poisshical8a

Population predictive performance Tested on new responses to the same stimuli. normalized ARD predictive power.6.4.2.2.2.2.4.6 normalized ML predictive power.2 1 2 3 4 normalized noise power.3.2.1 1 2.1 number of recordings normalized predictive power difference (ARD ML) Parameters can still overfit to the particular stimulus used.

Population predictive performance Tested on responses to new stimuli. normalized ARD predictive power.5.5.5.5 normalized ML predictive power.6.5.4.3.2.1.1 1 2 3 4 normalized noise power.6.5.4.3.2.1 5 1.1 number of recordings normalized predictive power difference (ARD ML)

Evaluating constrained linear models by prediction

Separating signal power from noise y(t) = µ(t) + η(t) }{{}}{{} signal noise Taking powers: ( P (y(t)) = (y(t) y ) 2 ) The maximum possible reduction in variance is given by the power of the stimuluslocked signal. Repeated trials that use the same stimulus allow us to separate signal from noise. The estimator is unbiased, and has a computable variance, provided the noise distribution is not pathalogical. This is an estimated bound, but will not be practicably achievable because any model (even the right one!) will overfit given limited data. P (y) }{{} observed power Averaging N trials = P (µ) + σ }{{} 2 }{{} signal power average noise varia P (y) = P (µ) + σ2 N Thus NP (y) P (y) P (µ) = N 1 Which is an unbiased estimator. Its variance is given by [ ] Var P = 4 ( 1 N T 2 µ Σµ 2 ) T µσ µ + µσµ + 2 N(N 1) ( 1 T 2 Tr [ΣΣ] 2 T σ σ + σ 2 ) where Σ, σ and σ are statistics describing the (co)variance of the noise and must themselves be estimated from data.

Non-parametric predictability estimates 1.2.1975 1.1646 Non-parametric prediction can provide an estimated lower bound on predictable power. The cell response function is not modelled explicitly. Instead, responses are predicted based on observations made on similar stimuli. The simplest approach is to average the training data over trials. However, this will overfit to the noise severely. Also, it does not allow prediction on new stimuli. An alternative is to smooth the average with reference to the stimulus. This is sometimes known as locally-weighted regression or Parzen-window regression..8.6.4.2 2 1.8 1.6 1.4 1.2 1.8.6.4.2.1317.988.658.329 linear ML linear ARD average R211182G/21731/pen14loc2poisshical2.495.445.396.346.297.247.198.148.99.49 linear ML linear ARD average parzen window R217251G/21922/pen7loc2poisshical3

Nonlinearities the Volterra Series The usual extension to non-linear functionals involves a power series in x(t). ŷ(t) = G[x] = g + dτ 1 g 1 (τ 1 )x(t τ 1 ) + dτ 1 dτ 2 g 2 (τ 1, τ 2 )x(t τ 1 )x(t τ 2 ) +... G(x t ) = g + x t g 1 + x t g 2 x t + = g + x t g 1 + x t x t g 2 + g = 1 } x t {{ x t x t } φ V (x t ) g 1 g 2 w G(x t ) = φ V (x t ) w = y t Φ V (X)W = Y

Thus, higher order terms can be estimated by linear regression in an augmented space.

Simulated complex cell actual filter maximum likelihood (ML) ridge regression (MAP) 2 normalized power 1.5 1.5 automatic relevance determination (ARD) ML MAP ARD

Kernel methods x i φ : X Φ φ(x i ) X Φ For certain embeddings φ we can find a kernel function K : X X IR such that K(x 1, x 2 ) = φ(x 1 ) φ(x 2 ) Thus linear operations in Φ can be performed without explicit embedding. Φ can be very high-dimensional (even ) without intractibility. Only certain K have an associated φ and vice versa (Mercer).

Kernel regression Linear regression in the augmented space requires finding a weight vector γ to solve Φ(X) γ = Y If γ = φ(x γ ) we could write K(x t, x γ ) = y t but this is nonlinear in x γ and the restriction γ φ(x) Φ is too stringent. Instead take γ = i w i φ(x i ) giving w i K(x t, x i ) = y t or KW = Y i If {x i } = {x t } this representation is complete because any additional part of γ would lie in the null space of Φ. In practice, we may choose {x i } {x t } randomly, or by a greedy process.

Volterra space and the polynomial kernels The Volterra augmented space has a corresponding kernel function. Define K p(1) (x 1, x 2 )= x 1 x 2 K p(2) (x 1, x 2 )= (x 1 x 2 ) 2 ( ) 2 = x i 1x i 2 = x i 1x i 2x j 1 xj 2 = i ij ij = (x 1 x 1)(:) (x 2 x 2)(:). (x i 1x j 1 )(xi 2x j 2 ) then the kernel corresponds to the Volterra augmentation K V (x 1, x 2 )= K p(1) (x 1, x 2 ) + K p(2) (x 1, x 2 ) +... φ V (x)= x(:) xx (:). (Also useful is the kernel K p(n) (x 1, x 2 ) = (1 + x 1, x 2 ) n )

Other kernels Volterra expansion polynomial kernels Other kernels can provide other non-linear expansions of the transform; some with useful intuitive interpretations. For example, K σ (x 1, x 2 ) = e x 1 x 2 2 /2σ 2 y t = i w i e x t x i 2 /2σ 2 ARD will identify a small number of characteristic vectors x i, providing a tiled approximation to the nonlinearity.

Simulated complex cell 3.7e+3 2e+2 1.7e+2 14 9 7 1.7 3.5 33 42 66 1.2e+2 1.9e+2 2.2e+2 2.2e+2 2.7e+2 3.8e+2 4.3e+2 4.6e+2 5.6e+2 5.7e+2 6.8e+2

Input non-linearities Σ

Hinge weights example 1 1.2469 linard 1 5 24 18 12 6 25 time (ms) frequency (khz).8.1646.4 linear (ML) hinge (ARD) linear (ARD).823 hinge (ARD) 1 5 25 1 5 24 18 12 6 25 time (ms) frequency (khz) frequency (khz) R211182G/21731/pen14loc2poisshical2

Hinge weights example 2 1.5 1.5 linear (ML) hinge (ARD) linear (ARD).124.83.41 linear (ARD) 1 5 24 18 12 6 25 time (ms) hinge (ARD) 1 5 25 1 5 24 18 12 6 25 time (ms) frequency (khz) frequency (khz) frequency (khz) J21492G/21717/pen8loc3poisshical8

Conclusions The linear regression viewpoint for both linear and nonlinear finite-impulse systems identification permits the straightforward application of advanced regression techniques. In particular, ARD allows us to restrict the range of the filter to the relevant input subspace. The standard Volterra expansion for non-linear functions is a special case of the general kernel regression method. Other kernels can provide equally, if not more, interpretable representations of the nonlinearity, particularly when combined with ARD.