Learning and Meta-learning

Size: px

Start display at page:

Download "Learning and Meta-learning"

Dinah Rice
5 years ago
Views:

1 Learning and Meta-learning computation making predictions choosing actions acquiring episodes statistics algorithm gradient ascent (eg of the likelihood) correlation Kalman filtering implementation Hebbian synpatic plasticity neuromodulation 1

2 Adaptation and Learning Two strategies: bottom-up rules of neural plasticity (Hebbian; sliding threshold) mathematical application (development) computational significance (PCA; projection pursuit; probabilistic model fitting) top-down function(s) E(w) w w E(w) delta rule; basis functions 2

3 Types of Learning supervised v u inputs u and desired or target outputs v both provided reinforce max r u input u and scalar evaluation r often with temporal credit assignment problem unsupervised u or self-supervised learn structure from statistics These are closely related: supervised learn P[v u] unsupervised learn P[v, u] 3

4 Famously suggested: Hebb if cell A consistently contributes to the activity of cell B, then the synapse from A to B should be strengthened strong element of causality what about weakening (LTD)? multiple timescales STP to protein synthesis multiple biochemical mechanisms systems: hippocampus multiple sub-areas neocortex layer and area differences cerebellum LTD is the norm 4

5 Neural Rules field potential amplitude (mv) s Hz LTP 10 2 min Hz LTD time (min) potentiated level depressed, partially depotentiated level control level 5

6 Stability and Competition Hebbian learning involves positive feedback. Control by: LTD usually not enough covariance versus correlation saturation prevent synaptic weights from getting too big (or too small) triviality beckons competition spike-time dependent learning rules normalization over pre-synaptic or post-synaptic arbors subtractive: decrease all synapses by the same amount whether large or small multiplicative: decrease large synapses by more than small synapses 6

7 Preamble Linear firing rate model Nu dv τ r dt = v + w u = v + b=1 w b u b assume that τ r is small compared with the rate of change of the weights, then during plasticity v = w u Then have τ w dw dt = f(v,u,w) Supervised rules use targets to specify v neural basis in ACh? 7

8 The Basic Hebb Rule τ w dw dt = uv averaged over input statistics gives dw τ w = uv = uu w = Q w dt where Q is the input correlation matrix. Positive feedback instability τ w d dt w 2 = 2τ w w dw dt = 2v2 Also have discretised version w w + T τ w Q w. integrating over time, presenting patterns for T seconds. 8

9 Covariance Rule Since LTD really exists, contra Hebb: or τ w dw dt = u (v θ v) τ w dw dt = (u θ u ) v If θ v = v or θ u = u then τ w dw dt = C w where C = (u u )(u u ) is the input covariance matrix. Still unstable τ w d dt w 2 = 2v(v v ) which averages to the (positive) covariance of v. 9

10 BCM Rule Odd to have LTD with v = 0 or u = 0. Evidence for 1.5 τ w dw dt = vu(v θ v). weight change/u v If θ v slides to match a high power of v τ θ dθ v dt = v2 θ v with a fast τ θ, then get competition between synapses intrinsic stabilization. 10

11 Subtractive Normalisation Could normalise w 2 or wb = n w n = (1,1...,1) For subtractive normalisation of n w: dw v(n u) τ w = vu n dt N u with dynamic subtraction, since dn w τ w dt as n n = N u. = vn u ( 1 n n N u ) = 0. Strongly competitive typically all the weights bar one go to 0. Therefore use upper saturating limit. 11

12 The Oja Rule A multiplicative way to ensure w 2 is constant gives τ w dw dt = vu αv2 w d w 2 τ w dt so w 2 1/α. = 2v 2 (1 α w 2 ). Dynamic normalisation could also enforce normalisation all the time. 12

13 Timing-Based Rules A 140 B 90 epsp amplitude (% of control) (+10 ms) (±100 ms) (-10 ms) time (min) percent potentiation t post - t pre (ms) 100 slice cortical pyramidal cells; Xenopus retinotectal system window of 50ms gets Hebbian causality right rate-description τ w dw dt = 0 dτ (H(τ)v(t)u(t τ) + H( τ)v(t τ)u(t)). spike-based description necessary if an input spike can have a measurable impact on an output spike. critical factor is the overall integral net LTD with local LTP. partially self-stabilizing 13

14 Single Postsynaptic Neuron Basic Hebb rule: τ w dw dt = Q w analyse using an eigendecomposition of Q: Q e µ = λ µ e µ λ 1 λ 2... Since Q is symmetric and positive (semi-)definite complete set of real orthonormal evecs with non-negative eigenvalues whose growth is decoupled Write then w(t) = N u µ=1 c µ (t) = c µ (0)exp and w(t) α(t)e 1 as t c µ (t)e µ ( ) t λ µ τ w 14

15 Constraints α(t) = exp(λ µ t/τ w ). Oja makes w(t) e 1 / α saturation can disturb outcome A 1 B w w w w 1 subtractive constraint τ w ẇ = Q w (w Q n)n N u. Sometimes e 1 n so its growth is stunted; and e µ n = 0 for µ 1 so w(t) =(w(0) e 1 )e 1 + N u µ=2 exp ( ) λµ t τ w (w(0) e µ )e µ 15

16 Translation Invariance Particularly important case for development has u b = u Q bb = Q(b b ) Write n = (1,...,1) and J = nn T, then Q = Q N u 2 J 1. e µ n = 0, AC modes are unaffected 2. e µ n 0, DC modes are affected 3. Q has discrete sines and cosines as eigenvectors 4. fourier spectrum of Q are the eigenvalues 16

17 PCA What is the significance of e 1? A 2 u 2, w 2 B C u 1, w 1 2 u 2, w u 2, w u 1, w 1 u 1, w 1 optimal linear reconstruction: minimise E(w,g) = u gv 2 information maximisation: I[v,u] = H[v] H[v x] under a linear model assume u = 0 or use C instead of Q. 17

18 Linear Reconstruction E(w,g) = u gv 2 = K 2w Q g + g 2 w Q w quadratic in w with minimum at w = g g 2 making E(w,g) = K g Q g g 2. look for soln with g= k(e k g)e k and g 2 =1: E(w,g) = K N k=1 (e k g) 2 λ k clearly has e 1 g = 1 and e 2 g = e 3 g =... = 0 Therefore g and w both point along principal component 18

19 Infomax (Linsker) argmax w I[v,u] = H[v] H[v u] Very general unsupervised learning suggestion: H[v u] is not quite well defined unless v = w u + η where η is arbitrarily deterministic H[v] = 1 2 log2πeσ2 for a Gaussian. If P[u] N[0,Q] then v N[0,w Q w + υ 2 ] maximise wqw T subject to w 2 = 1 Same problem as above: implies that note the normalisation w e 1. If non-gaussian, only maximising an upper bound on I[v, u]. 19

20 Ocular Dominance Ú µ cortex Ï L µ µ Ï R µ µ competitive interaction Ù L µ left thalamus right Ù R µ retina-thalamus-cortex OD develops around eye-opening interaction with refinement of topography interaction with orientation interaction with ipsi/contra-innervation effect of manipulations to input Ï L R L R Ï ocularity L R 20

21 Start Simple Consider one input from each eye Then has v = w R u R + w L u L. Q = uu = ( qs q D q D q S ) e 1 = (1,1)/ 2 e 2 = (1, 1)/ 2 λ 1 = q S + q D λ 2 = q S q D so if w + = w R + w L, w = w R w L then τ w dw + dt = (q S + q D )w + τ w dw dt = (q S q D )w. Since q D 0, w+ dominates so use subtractive normalisation τ w dw + dt = 0 τ w dw dt = (q S q D )w. so w ±ω and one eye dominates. 21

5 1 6 4 2 0 2 4 6 b 0 0 2 ~ 4 6 b Now dominant mode of Q has spatial

22 Orientation Selectivity Model is exactly the same input correlations come from ON/OFF cells: C) 1 D) 1.5 Q(b) ~ ~ Q (b) b ~ 4 6 b Now dominant mode of Q has spatial structure: centre-surround version also possible, but is usually dominated because of non-linear effects. 22

23 Temporal Hebbian Rules Look at rate-based temporal model as w = 1 T dt v(t) τ w 0 ignoring some edge effects. Correlate output v(t) with filtered version of the input dτ H(τ)u(t τ) dτ H(τ)u(t τ) ie look for structure at the scale of the temporal filter 23

24 Multiple Output Neurons output v input u M W u 1 u 2 u 3 u Nu Fixed recurrent connections leads to τ r dv dt = v + W u + M v v = W u + M v = K W u where K=(I M) 1. Thus with Hebbian learning dw τ w = vu = K W Q dt and we can analyse the eigeneffect of K. 24

25 Ocular Dominance Revisited A B u L u R Write w + = w R + w L,w = w R w L, for the projective weights, then τ w dw + dt = (q S + q D )K w + τ w dw dt = (q S q D )K w Since w + is clamped by subtractive normalisation, just interested in the pattern of ± in w. Since K is Töplitz eigenvectors are waves; eigenvalues come from the Fourier transform. A K, e 0 K ~ -0.5 B cortical distance (mm) k (1/mm) 25

26 Comp Hebbian Learning Use a competitive non-linearity z a = ( b W ab u b ) δ a ( b W a bu b ) δ in conjunction with a postive interaction term v a = a M aa z a. and standard Hebbian learning: A Ï L Ï R B Ï R Ï L C output a output a output a left input b right input b input b L R Features: ocularity b W topography b W + x b 26

27 Feature-Based Models Reduced descriptions (x, y, z, r cos(θ), r sin(θ)) x, y topographic location z ocularity ( [ 1, 1]) r orientation strength θ orientation matching replace [W u] a by exp ( b (u b W ab ) 2 /2σ 2 b plus softmax competition and cortical interaction learning self organizing map dw ab τ w = v a (u b W ab ). dt or elastic net only competition and ) τ w dw ab dt = v a (u b W ab ) +β a N(a) (W a b W ab ) 27

28 Large-Scale Results meshing of the patterns of OD and OR: pinwheels linear zones ocular dominance boundaries overall pattern of OD stripes vs elastic net simulation 28

29 Redundancy Multiple units redundancy: Hebbian learning all units the same fixed output connections inadequate One possibility is decorrelation: vv = I. If Gaussian, then complete factorisation. Three approaches: Atick & Redlich force n n mapping and decorrelate using anti-hebbian learning. Földiák use Hebbian and anti-hebbian learning to learn feedforward and lateral weights. Sanger explicitly subtract off first component from subsequent ones. Williams subtract off predicted portion of u 29

30 Goodall v = W u + M v Anti-Hebbian learning is ideal for lateral weights: if v a and v b are correlated make M ab = M ba negative which reduces the correlation Goodall n n with W = I so: Then At Ṁ = 0 v = (I M) 1 x = K x. τ M Ṁ = uv + I M uu K = K 1 K Q K = I. So uu = K uu K = I as required. 30

31 Temporal Plasticity Using the temporal rule: τ w dw dt = 0 dτ (H(τ)v(t)u(t τ) + H( τ)v(t τ)u(t)) A 1.2 v s B place field location (cm) lap number s a = 2 is active before s a = 0 synapse 2 0 gets strengthened s a = 0 extends its firing field backwards 31

32 Supervised Learning Consider case of learning pairs u m, v m : classification binary v m to classify real-valued u m. regression real-valued mapping from u m to v m. storage learn the relationships in the data generalisation infer a functional relationship from limited examples error-correction mistakes drive adaptation Hebbian plasticity: τ w dw dt = vu = 1 N S N S m=1 and (multiplicative) weight decay τ w ẇdt = vu αw, v m u m. makes w vu /α. No positive feedback. 32

33 Classification and the Perceptron Classification rule { 1 if w u γ 0 v = 0 if w u γ < 0 Can use supervised Hebbian learning w = 2 N S N u m=1 v m u m. but works quite poorly for random patterns A B 1 u 3 w probability correct u 1 u N u N S -1 33

34 Function Approximation Basis function network Ú µ Û µ µ Ù µ s output v(s) = w u = w f(s) error E = (h(s) 1 w f(s)) 2 2 reaches a minimum at (normal equations) f(s)f(s) w = f(s)h(s). 34

35 Hebbian Function Approximation When does the Hebbian w = f(s)h(s) /α satisfy the normal equations f(s)f(s) w = f(s)h(s)? 1. input patterns are orthongonal f(s)f(s) = I 2. tight frame condition as then f(s)f(s) w = f(s m ) f(s m ) = cδ mm f(s)f(s) f(s)h(s) = 1 αn 2 S = c αn 2 S α f(s m )f(s m ) f(s m )h(s m ) mm f(s m )h(s m ) m = c αn S f(s)h(s) V1 forms an approximate tight frame 35

36 Error-Correcting Rules Hebbian plasticity is independent of the performance of the network Perceptron learning rule: if v(u m ) = 0 when v m = 1, modify w and γ to increase w u m γ easiest rule: implies that w w + ǫ w (v m v(u m ))u m γ γ ǫ w (v m v(u m )) (w u m γ) = ǫ w (v m v(u m )) ( u m ) which has just the right sign. In fact, guaranteed to converge. note the discrete nature of the weight update 36

37 The Delta Rule Defintion of the task in E(w) how well (poorly) do synaptic weights w perform? Gradient descent: w w ǫ w w E(w) since if w =w ǫ w E(w), then to first order in ǫ w : E(w ǫ w w E) = E(w) ǫ w w E 2 1 E(w) 0.8 E(w) w 37

38 Stochastic Gradient Descent E(w) = 1 (h(s) w f(s)) 2 is an average 2 over many examples. Use random input-output paris s m, h(s m ) and change w w ǫ w w (h(s m ) v(s m )) 2 /2 = w + ǫ w (h(s m ) v(s m ))f(s m ) called stochastic gradient descent. A B C v s s s 38

39 Contrastive Hebbian Learning The delta rule involves: w w + ǫ w (v m u m v(u m )u m ) Hebbian learning v m u m based on target anti-hebbian learning v(u m )u m based on outcome learning stops when outcome = target Generalize to a stochastic network P[v u;w] = exp( E(u,v)) Z(u) Z(u) = exp( E(u, v)) v weights W generate a conditional distribution eg with quadratic form E(u,v) = u W v 39

40 Goal of Learning Natural quality measure for u: ( ) P[v u] D KL (P[v u], P[v u;w]) = P[v u]ln P[v u;w] v = v P[v u]ln(p[v u;w]) + K, average over u m ; v m is sample of P[v u m ] DKL (P[v u], P[v u;w]) 1 N S N S ln(p[v m u m ;W]) m=1 amounts to maximum likelihood learning. ln P[v m u m ;W] W ab = ( ) E(u m,v m ) ln Z(u m ) W ab = v m a um b v P[v u m ;W]v a u m b. is also Hebb anti-hebb positive negative use Gibbs sampling for v P[v u m ;W] unsupervised version is just the same 40

41 Function Approximation Two ignored concerns: 1. examples may be corrupted by noise D {(s m, v m )} v m = h(s m ) + η m 2. want v(s) = h(s) even away from training set in general impossible need prior information eg smoothness of h(s) Gaussian process says that given s 1, s 2,..., s K, h(s 1 ), h(s 2 ),..., h(s K ) are jointly Gaussian with 0 mean (short of expectations about bias covariance matrix determined by prior expectations eg Σ µν = Σ( s µ s ν ) Then distribution of h(s) given D is Gaussian with mean that depends linearly on v 1,..., v N S. 41

42 Functional Priors Alternatively, specify: P[h] e 1 2 φ[h], where φ[h] is a functional, eg: φ[h] = s d s h( s) 2 G( s) in the Fourier domain. G( s) penalises high frequency wobbles in h(s). G( s) = e s 2 /δ If η µ are iid Gaussians, then likelihood is: P[D h] exp 1 2σ 2 N S m=1 (v m h(s m )) 2 So posterior distributions over h(s) is: log P[h D] = log P[D h]p[h] + K = 1 2σ 2 N S m=1 (v m h(s m )) 2 1 φ[h] + K 2 42

43 Regularization Theory Find MAP solution by minimising: E[h] = 1 N S σ 2 Turns out that h(s) = m=1 N S m=1 (v m h(s m )) 2 + φ[h], w m G(s s m ) + k(s) where G(s s µ ) is IFT of smoothness prior and k(s) satisfies φ[k] = 0. The weights are given by w = (G + σ 2 I) 1 v. where G mn = G(s m s n ) recovering linear dependence on v. For Gaussian G( s), G(s s m ) e δ s sm 2 is a radial basis function: 43

44 Regularization Theory v( s)=h( s) w 1 w N G (s-s m ) s one basis function for each example basis functions determined by prior as σ 2 0, MAP solution interpolates y µ use Bayesian hyperparameters to manipulate the norm fixed basis functions inference over weights: E(w) = 1 N S σ 2 m=1 eg φ(w) = w is Gaussian (v m h(s m )) 2 + φ(w), or learn the basis functions themselves using backprop. 44

Outline. NIP: Hebbian Learning. Overview. Types of Learning. Neural Information Processing. Amos Storkey

Outline. NIP: Hebbian Learning. Overview. Types of Learning. Neural Information Processing. Amos Storkey Outline NIP: Hebbian Learning Neural Information Processing Amos Storkey 1/36 Overview 2/36 Types of Learning Types of learning, learning strategies Neurophysiology, LTP/LTD Basic Hebb rule, covariance