Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References

Size: px

Start display at page:

Download "Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References"

Eugene Porter
5 years ago
Views:

1 24th March 2011 Update

2 Hierarchical Model Rao and Ballard (1999) presented a hierarchical model of visual cortex to show how classical and extra-classical Receptive Field (RF) effects could be explained by Bayesian inference in a cortical hierarchy. y = W 1 x 1 + e 1 x 1 = W 2 x 2 + e 2 Update

3 The corresponding graphical model is The joint likelihood of data and model parameters is p(y, W 1, W 2, x 1, x 2 ) = p(y W 1, x 1 )p(w 1 )p(x 1 x 2, W 2 )p(w 2 )p(x 2 ) Update The first and second level prediction errors are assumed isotropic Gaussian with precisions λ 1 and λ 2. The priors over W 1 and W 2 are zero mean Gaussian, having isotropic covariances with precisions α 1 and α 2.

4 All parameters W 1, W 2, x 1 and x 2 are learnt by gradient ascent of the relevant part of the the joint likelihood. For example, for x 1 we have L(x 1 ) = log[p(y W 1, x 1 )p(x 1 x 2, W 2 )] as all other terms do not depend on x 1. Maximising this function will implicitly maximise the posterior probability p(x 1 y). Update

5 The joint log-likelihood as a function of first layer activity is L(x 1 ) = λ 1 2 (y ŷ)t (y ŷ) λ 2 2 (x 1 ˆx 1 ) T (x 1 ˆx 1 ) with image predictions ŷ = W 1 x 1 and predictions of first layer activity ˆx 1 = W 2 x 2 Update L(x 1 ) comprises precision weighted prediction error terms from both levels.

6 The activity is then updated by gradient ascent which gives τ x dx 1 dt τ x dx 1 dt = dl(x 1) dx 1 = λ 1 W T 1 (y W 1x 1 ) + λ 2 (ˆx 1 x 1 ) This is the same as online Bayesian learning for linear systems (last lecture). These updates have a simple predictive coding implementation. Update

7 Architecture Mumford (1992) I put forward a hypothesis on the role of the reciprocal, topographic pathways between two cortical areas, one often a higher area dealing with more abstract information about the world, the other lower dealing with more concrete data. The higher area attempts to fit its abstractions to the data it receives from lower areas by sending back to them from its deep pyramidal cells a template reconstruction best fitting the lower level view. The lower area attempts to reconcile the Update reconstruction of its view that it receives from higher areas with what it knows, sending back from its superficial pyramidal cells the features in its data which are not predicted by the higher area.

8 Update Update e 1 = y W 1 x 1 e 2 = x 1 W 2 x 2

9 Update τ x ẋ 1 = λ 1 W T 1 e 1 +λ 2 e 2 τ x ẋ 2 = λ 2 W T 2 e 2

10 Update τ w Ẇ 1 = λ 1 e 1 x T 1 α 1W 1 τ w Ẇ 2 = λ 2 e 2 x T 2 α 2W 2

11 Each second level unit effectively sees the whole image. Increasing Receptive Field Size In Rao and Ballard (1999) a Gaussian weighting was applied to the input of each first level node, so that it only sees a localised portion of the input image. Update

12 Increasing Receptive Field Size The images y (5 shown) are modelled with j = 1..3 first level modules with overlapping receptive fields. y j = W 1j x 1j + e 1 (256 1) = (256 32)(32 1) + (256 1) Each module is a linear expansion of a basis set W, with coefficients x which are different for each image. Each module predicts activity in a pixel patch. One can think of the ith row of W 1j as the projective field of the ith neuron in the jth module. And the ith entry in x 1j as the activity or firing rate of the ith neuron in the jth module. These coefficients are then constrained by a second level model Update x 1 = W 2 x 2 + e 2 (96 1) = (96 128)(128 1) + (96 1)

13 Update e 1j = y j W 1j x 1j e 2j = x 1j W 2j x 2 (j)

14 Update τ x ẋ 1j = λ 1 W T 1j e 1j + λ 2 e 2j

15 Update τ w Ẇ 1j = λ 1 e 1j x T 1j α 1 W 1j

16 Receptive Fields Level 1 receptive fields are reminiscent of Difference-of-Gaussian (DOG) filters that have been used to model simple-cell RFs in primary visual cortex. Update Level 2 cells respond to more complex features.

17 Update The response of level-1 prediction error units diminishes for bars that go beyond the end of each level-1 receptive field. So-called end stopping.

18 Error unit activity naturally arises from difference in bottom-up activity and top down predictions. Update The network was trained on natural images for which short bars seldom occur in isolation. Short bars are generally part of longer bars.

19 End Stopping disappears if feedback in the model is disabled or if layer 6 activity is inactivated in squirrel monkey (Sandell and Schiller, 1982). Update

20 Nonlinear models Friston (2003) considers nonlinear hierarchical models of the form x 1 = g(x 2, w 1 ) + e 1 x 2 = g(x 3, w 2 ) + e 2.. =.. x R 1 = g(x R, w R 1 ) + e R 1 where y = x 1 is the observed data, g(x i+1, w i ) is some nonlinear function of hidden causes x i+1 and parameters w i, and e i is zero mean additive Gaussian noise with covariance C i and i indexes the level in the hierarchy. C i is parameterised by λ i. These equations embody structural priors. The generative model is not dynamic. The recognition model is. Update

21 There are no priors over w or λ. Update The joint probability of activities in all regions r = 1..R is therefore R p(x w, λ) = p(x i x i+1, w i, λ i ) = i=1 R N(x i ; g(x i+1, w i ), C i ) i=1

22 Joint Log Likelihood The joint log likelihood is L(x, w, λ) = R log p(x i x i+1, w i, λ i ) i=1 This can be written as (dropping constant terms) L(x, w, λ) = R i=1 ( 1 2 et i e i 1 ) 2 log C i where the prediction errors are given by Update e i = C 1/2 i [x i g i (x i+1, w i )] The hidden causes, parameters and variance components can be estimated using a gradient ascent scheme to find their MAP values.

23 In the univariate case, if the error covariances have the form C 1/2 i = 1 + λ i then the prediction errors can be written. The last term acts as a decay with time constant λ i. Hence e i = (1 + λ i ) 1 [x i g(x i+1, w i )] Rearranging gives Update e i = [x i g(x i+1, w i )] λ i e i

24 The joint log-likelihood is where Hence so dl(x i ) x i L(x) = i 1 2 et i e i +... e i = (1 + λ i ) 1 [x i g(x i+1, w i )] L(x i ) = 1 2 et i 1 e i et i e i = ( dei 1 dx i ) T ( ) T dei e i 1 e i dx i Update

25 For the hidden causes we have Update τ x ẋ i = dl dx i = ( ) T ( ) T dei 1 dei e i 1 dx e i i dx i The first term is instantiated via forward recognition effects and the second term via lateral connections. These lateral connections embody the prior at each level. Connections are reciprocal.

26 Top-down synapses For the top-down connections we have Update τ w w i = dl dw i = ( ) T dei e i dw i This reduces to Hebbian learning for linear models τ w ẇ i = (1 + λ i )e i x i+1

27 Recurrent synapses on error units For the variance components τ λ λ i = dl dλ i = < ( ) T dei e i > 1 dλ i 1 + λ i Update The self-connections whiten the errors τ λ λi = (1 + λ i ) 1 (e i e T i 1)

28 K. Friston (2003) Learning and inference in the brain. Neural Networks 16, M. Mesulam (1998) From sensation to cognition. Brain (121), D. Mumford (1992) On the computational architecture of the neocortex II The role of cortico-cortical loops. Biological Cybernetics 66, R. Rao and D. Ballard (1999) Nature Neuroscience 2, Update G. Shepherd (2004). The Synaptic Organisation of the Brain. Oxford.

Principles of DCM. Will Penny. 26th May Principles of DCM. Will Penny. Introduction. Differential Equations. Bayesian Estimation.

Principles of DCM. Will Penny. 26th May Principles of DCM. Will Penny. Introduction. Differential Equations. Bayesian Estimation. 26th May 2011 Dynamic Causal Modelling Dynamic Causal Modelling is a framework studying large scale brain connectivity by fitting differential equation models to brain imaging data. DCMs differ in their