Unicycling Helps Your French: Spontaneous Recovery of Associations by. Learning Unrelated Tasks

Similar documents
Learning and Memory in Neural Networks

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

HOPFIELD neural networks (HNNs) are a class of nonlinear

memory networks, have been proposed by Hopeld (1982), Lapedes and Farber (1986), Almeida (1987), Pineda (1988), and Rohwer and Forrest (1987). Other r

Chapter 9: The Perceptron

Lecture 4: Perceptrons and Multilayer Perceptrons

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

Knowledge Extraction from DBNs for Images

EPL442: Computational

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network

2- AUTOASSOCIATIVE NET - The feedforward autoassociative net considered in this section is a special case of the heteroassociative net.

Unit 3: Number, Algebra, Geometry 2

Learning with Ensembles: How. over-tting can be useful. Anders Krogh Copenhagen, Denmark. Abstract

Multilayer Perceptron

Biologically Inspired Compu4ng: Neural Computa4on. Lecture 5. Patricia A. Vargas

Since x + we get x² + 2x = 4, or simplifying it, x² = 4. Therefore, x² + = 4 2 = 2. Ans. (C)

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

+ 2gx + 2fy + c = 0 if S

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Using Expectation-Maximization for Reinforcement Learning

Lecture 5: Logistic Regression. Neural Networks

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer

Course 395: Machine Learning - Lectures

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

Tennessee s State Mathematics Standards - Algebra II

Degrees of Freedom in Regression Ensembles

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Comparison of Modern Stochastic Optimization Algorithms

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Math 1270 Honors ODE I Fall, 2008 Class notes # 14. x 0 = F (x; y) y 0 = G (x; y) u 0 = au + bv = cu + dv

FINAL: CS 6375 (Machine Learning) Fall 2014

The Learning Objectives of the Compulsory Part Notes:

The Perceptron algorithm

Pure Core 2. Revision Notes

2018 LEHIGH UNIVERSITY HIGH SCHOOL MATH CONTEST

Learning Energy-Based Models of High-Dimensional Data

Support Vector Machines

Frequency Selective Surface Design Based on Iterative Inversion of Neural Networks

Links between Perceptrons, MLPs and SVMs

MATHEMATICS SYLLABUS SECONDARY 4th YEAR

Bidirectional Representation and Backpropagation Learning

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Lecture 7 Artificial neural networks: Supervised learning

Key competencies (student abilities)

Back-propagation as reinforcement in prediction tasks

CS534 Machine Learning - Spring Final Exam

DESK Secondary Math II

Parameter Norm Penalties. Sargur N. Srihari

Candidates are expected to have available a calculator. Only division by (x + a) or (x a) will be required.

Year 11 Mathematics: Specialist Course Outline

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm

CSC321 Lecture 9: Generalization

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Deep Learning & Artificial Intelligence WS 2018/2019

CSC321 Lecture 9: Generalization

On Information Maximization and Blind Signal Deconvolution

Introduction to Machine Learning Midterm Exam

PreCalculus: Chapter 9 Test Review

Prentice Hall Mathematics, Geometry 2009 Correlated to: Maine Learning Results 2007 Mathematics Grades 9-Diploma

December 2012 Maths HL Holiday Pack. Paper 1.2 Paper 1 from TZ2 Paper 2.2 Paper 2 from TZ2. Paper 1.1 Paper 1 from TZ1 Paper 2.

6.5 Trigonometric Equations

Cambridge International Examinations CambridgeOrdinaryLevel

Scope and Sequence: National Curriculum Mathematics from Haese Mathematics (7 10A)

Introduction Biologically Motivated Crude Model Backpropagation

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

P-CEP. Mathematics. From: P-CEP Mathematics Department To: Future AP Calculus BC students Re: AP Calculus BC Summer Packet

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Independent Component Analysis

CSC321 Lecture 8: Optimization

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CMSC 421: Neural Computation. Applications of Neural Networks

81-E If set A = { 2, 3, 4, 5 } and set B = { 4, 5 }, then which of the following is a null set? (A) A B (B) B A (C) A U B (D) A I B.

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation

Vectors a vector is a quantity that has both a magnitude (size) and a direction

Σ N (d i,p z i,p ) 2 (1)

Dynamical Systems and Deep Learning: Overview. Abbas Edalat

Newington College Mathematics Trial HSC 2012 (D) 2 2

8 th Grade Math Connects

Circle geometry investigation: Student worksheet

Temporal Backpropagation for FIR Neural Networks

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

10-701/ Machine Learning, Fall

Gauss s Law & Potential

arxiv: v2 [nlin.ao] 19 May 2015

DEPARTMENT OF MATHEMATICS

CAMI Education links: Maths NQF Level 4

Test Codes : MIA (Objective Type) and MIB (Short Answer Type) 2007

Integrated Mathematics I, II, III 2016 Scope and Sequence

A Sliding Mode Controller Using Neural Networks for Robot Manipulator

Asymptotic Convergence of Backpropagation: Numerical Experiments

1/sqrt(B) convergence 1/B convergence B

Lecture 15: Exploding and Vanishing Gradients

Integrated Math II Performance Level Descriptors

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Lecture 6: Graphical Models

Transcription:

Unicycling Helps Your French: Spontaneous Recovery of Associations by Learning Unrelated Tasks Inman Harvey and James V. Stone CSRP 379, May 1995 Cognitive Science Research Paper Serial No. CSRP 379 The University of Sussex School of Cognitive and Computing Sciences Falmer Brighton BN1 9QH England, UK inmanh@cogs.susx.ac.uk jims@cogs.susx.ac.uk This version (revised October 1995) is that accepted by Neural Computation. In Press 1995.

Unicycling Helps Your French: Spontaneous Recovery of Associations Inman Harvey School of Cognitive and Computing Sciences University of Sussex, Brighton BN1 9QH, England inmanh@cogs.susx.ac.uk by Learning Unrelated Tasks James V. Stone School of Biological Sciences University of Sussex, Brighton BN1 9QH, England jims@cogs.susx.ac.uk May 1995 revision October 1995 Abstract We demonstrate that imperfect recall of a set of associations can usually be improved by training on a new, unrelated set of associations. This spontaneous recovery of associations is a consequence of the high dimensionality of weight spaces, and is therefore not peculiar to any single type of neural net. Accordingly, this work may have implications for spontaneous recovery of memory in the central nervous system. 1 Introduction Aspontaneous recovery eect in connectionist nets was rst noted in (Hinton & Sejnowski, 1986), and analysed in (Hinton & Plaut, 1987). A net was rst trained on a set of associations, and then its performance on this set was degraded by training on a new set. When retraining was then carried out on a proportion of the original set of associations, performance also improved on the remainder of that set. In this paper a more general eect is demonstrated. A net is rst trained on a set of associations, called task A and then performance on this task is degraded, either by random perturbations of the connection weights, or as a result of learning a new task B. Performance on A is then monitored whilst the net is trained on another new task C. The main result of this paper is that in most cases performance on the original task A initially improves. The following is a simplistic analogy, which assumes that this eect carries over to human learning of cognitive tasks. If you have afrench examination tomorrow, but you have forgotten quite a lot of French, then a short spell of learning some new task, such as unicycling, can be expected to improve your performance in the French examination. Students of French should be warned not to take this fanciful analogy too literally it requires the implausible assumption that French and unicycling make use of a common subset of neuronal connections. We will rst give an informal argument to explain the underlying geometrical reasons for this eect follow this with an analysis of how it scales with the dimensionalityofweight-space and then demonstrate it with some examples. 2 High Dimensional Geometry Anumber of assumptions will be used here later we will evaluate their validity. Learning in connectionist models typically involves a succession of small changes to the connection weights between units. This can be interpreted as the movement of a point W in weight-space, the dimensionality of which is the number of weights. For the present, we assume that training on a particular task A moves W in a straight line towards a point A, where A represents the weights of a net which performs perfectly on task A we also assume that distance from A is monotonically related to decrease in performance on task A. Let A be the position of W after task A has been learned (see Figure 1). Assume that some `forgetting' takes place, either through random weight changes, or through some training on a dierent task, which shifts W to a new point B. The point B lies on the surface of H, ahypersphere of radius r = ja ; Bj centred on A. 1

0.500 θ = π/4 d/r=1.41 A r θ P B 1 d C Ratio R 0.200 0.100 0.050 0.020 0.010 0.005 0.002 0.001 0.0005 θ = 3π/8 θ=7π/16 θ=15π/32 θ=31π/64 d/r=2.61 d/r=5.13 d/r=10.2 d/r=20.4 B 2 Q 0.0002 3 10 20 50 100 200 500 1000 Dimension n Figure 1: The circle is a 2-D representation of hypersphere H. Initial movement from a point B on the circumference towards C has two possible consequences: trajectory B 1! C is outside H, whereas B 2! C intersects H. Figure 2: Graph of ratio R n against n. Both axes are logarithmically scaled. Data points are marked bycircles on the lines from = =4 on left to 31=64 on the right. We then initiate training on a task C which is unrelated to task A under our assumptions, training moves W from B towards a point C, which is distance d = ja ; Cj from A. If the line connecting B to C passes through the volume of H then the distance jw ; Aj initially decreases as W moves towards C. Insuch cases, training on task C initially causes improvement in performance on task A. We assume that point A hasbeenchosen from a bounded set of points S, which mayhave any distribution that H is centred on A that B is chosen from a uniform distribution over the surface of H and that C is chosen from S independently of the positions of A or B. What, then, is the probability that line segment BC passes through H? That is, what is the probability that training on task C generates spontaneous recovery on task A? If C lies within H (i.e. if d<r) then recovery is guaranteed. For any point C outside H there is a probability p 0:5 of recovery on task A. Figure 2 demonstrates this for a two-dimensional space. The point B may lie anywhere on the circumference of H. The line segment BC only fails to pass through H if B lies on the smaller arc PQ where CP and CQ are tangents to the circle, and hence cos() =r=d. Thus p 0:5, and p! 0:5 asd!1. Consider the extension to a third dimension, while retaining the same values r, d and. The probability q =(1;p) that BC fails to pass through the sphere H is equal to the proportion of the surface of H which lies within a cone dened by PCQwith apex C. This proportion is considerably smaller in 3-D than it is in 2-D. We produce a formula for this proportion for n-dimensions in the next section. We demonstrate analytically what can be seen intuitively, namely that for any given <=2, as n increases q tends to zero. 3 Analysis Let S(n r ) be the surface `area' of the segment ofahypersphere of radius r in n-dimensions, subtended by a (hyper-) cone of half-angle this segment is not a surface area, but rather a surface hypervolume of dimensionality (n ; 1). The surface `area' of the complete hypersphere is S(n r ). For some constant k n, S(n r )=k n r n;1.we can use this to calculate S(n r )byintegration: S(n r ) = We require the ratio R n of S(n r )tos(n r ). Z = =0 S(n ; 1 r sin() ) rd Z = k n;1 r n;1 sin n;2 () d 0 2

R n = R 0 sinn;2 () d R 0 sinn;2 () d This ratio R n gives the probability that the line segment BC (in Figure 1, generalised here to n-dimensions) fails to pass through the hypersphere, and is therefore equal to q. In Figure 2 we plot the ratio R n against the dimensionality n, for values of from =4 to31=64. These values of are associated with corresponding values of d=r (see Figure 2) between 1.41 and 20.4. For a given value of d=r, as the dimensionality n increases, the ratio R n tends to zero. For large n, it is almost certain that the line segment BC passes through the hypersphere H. This implies that initially the point W moves from B closer to A. Hence performance improves, at least temporarily, on task A. Returning to the assumptions stated earlier, we cannow examine their validity. First, an irregular error surface ensures that training does not, in general, move W in a straight line. Second, if perturbation from A to B is achieved by training on a task B then B is chosen from a distribution over H which may not be uniform. Third, perfect performance on task C may be associated not with one point C, but with many points which are equivalent in that they each provide a similar mapping from input to output. W may move towards the nearest of many Cs, which is therefore not chosen from S independently of A. Thismay alter the probability thatw passes through H. Fourth, if B lies on a hypersphere of dimensionality m < nthen the probability that spontaneous recovery occurs may be reduced. Despite these considerations, evidence of the eects predicted above can be observed. 4 Experimental results In two sets of experiments a net was initially trained using back-propagation on a task A. The net had 10 input, 100 hidden, and 10 output units. The hidden and output units had logistic activation functions weights were initialised to random values in the range [0:3 ;0:3]. Task A was dened by 100 pairs of binary vectors which were chosen (without replacement) from the set of 2 10 vectors. The members of each pair were used to train the net using batch update for 1300 training epochs, with a learning rate =0:02 and momentum =0:9 1. After initial training on task A, the weights were perturbed from A by one of two dierent methods. Experiment 1: Perturbing Weights by New Training After training on A, the net was trained for 400 epochs on 5 new 2 vector pairs in order to perturb the weights away from A to a point B. Finally, the net was trained on a further 5 new vector pairs (task C) for 50 epochs. During training on C, the performance of the net on task A was monitored. As predicted by the analysis above, performance on task A usually showed a transient improvement as training on task C progressed. This procedure was repeated 380 times using a single A, 20B's and 20 C's 3. Figure 3 shows how many runs improved or declined in performance on task A at the end of each epoch (together with a few runs which showed no change, given the precision used in calculations). A proportion 241/380 (63.4%) showed incidental relearning on A after the rst epoch, but this dropped to less than 50% after the 5th epoch. Experiment 2: Perturbing Weights Randomly After training on task A as above, the weights of the net were perturbed by adding uniform noise. In Experiment 1, the distance ja ; Bj had a mean of about 7. In order to make perturbations of comparable magnitude W was perturbed from A to B by adding a random vector of length 7. As described in Experiment 1 this was repeated 380 times. The proportion of the runs which showed incidental relearning is given in Figure 3 this was 248/380 (65.3%) after the rst epoch, and remained above 50% for the 50 epochs. 1 This was designed to be comparable to the situation described in (Hinton & Plaut, 1987). 2 Here \new" implies that the none of these vectors exist in any previous training set. 3 B and C were chosen without replacement from 20 sets of 5 vector pairs, giving 20 19 = 380 dierent possibilities. 3

Numbers out of 380 runs Better at task A Numbers out of 380 runs Better at task A 350 Worse at task A 350 Worse at task A 300 250 No change 300 250 No change 200 50 percent 200 50 percent 150 150 100 100 50 50 0 0 0 20 40 Epochs 0 20 40 Epochs Figure 3: Graphs of 380 runs, showing numbers improving in performance on A, during 50 epochs of training on C. On the left, experiment 1, the weight vector W was perturbed from A to B by training on task B. On the right, experiment 2, W was perturbed by a randomly chosen vector of length 7 from A to B. 5 Discussion A new eect has been demonstrated, such that performance on some task A improves initially (from a degraded level of performance), when training is started on an unrelated task C. This has been analysed in terms of the geometrical properties of high-dimensional spaces. In applying this to weight-spaces, we rely on simplistic assumptions about the way training on a task relates to movement through weight-space. The eect can be observed even if these assumptions are violated, as demonstrated by experiment. The graphs show evidence of spontaneous recovery. The eect can be seen to be short-lived in the rst case, in which perturbation was achieved by retraining on B, and sustained in the second case, in which perturbation was random. Only in the latter case can we expect B to have been chosen from a unbiased distribution over the surface of H, unrelated to the position of C. The graphs indicate only the probabilities of improvement, without reference to the magnitudes of these eects in individual runs. This recovery eect may be relevant to counter-intuitive phenomena described in (Parisi, Nol, & Cecconi, 1992 Nol, Elman, & Parisi, 1994), and may also contribute to the eect described in (Hinton & Sejnowski, 1986). It has been suggested 4 that the eect may be related to James-Stein shrinkage (Efron & Morris, 1977 James & Stein, 1961). That is, reducing the variance of (net) outputs reduces the squared error at the expense of introducing a bias. It may be that training on C incidentally induces shrinkage. The observed eect is weaker than that predicted from the geometrical argument given above, presumably due to the simplistic nature of the assumptions used therein. However, the eect is robust inasmuch as it does not depend on the learning algorithm, nor on the type of net used. For this reason, the eect may have implications for spontaneous recovery of memory in the central nervous system. Acknowledgments Funding for the authors has been provided by the E.P.S.R.C. and the Joint Council Initiative. We thank the referees for useful comments. 4 G. E. Hinton, personal communication. 4

References Efron, B., & Morris, C. (1977). Stein's paradox in statistics. Scientic American, 236 (5), 119{127. Hinton, G., & Plaut, D. (1987). Using fast weights to deblur old memories. In Proceedings of the 7th Annual Conference ofthecognitive Science Society, Seattle. Hinton, G., & Sejnowski, T. (1986). Learning and relearning in Boltzmann machines. In Rumelhart, D., McClelland, J., & the PDP Research Group (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations, pp. 282{317. MIT Press/Bradford Books, Cambridge MA. James, W., & Stein, C. (1961). Estimation with quadratic loss. In Proc. of the Fourth Berkeley Symposium on Mathematical Statistics and Probability 1, pp. 361{380 Berkeley, California. University of California Press. Nol, S., Elman, J., & Parisi, D. (1994). Learning and evolution in neural networks. Adaptive Behavior, 3 (1), 5{28. Parisi, D., Nol, S., & Cecconi, F. (1992). Learning, behavior and evolution. In Varela, F. J., & Bourgine, P. (Eds.), Toward a Practice of Autonomous Systems: Proceedings of the First European Conference on Articial Life, pp. 207{216. MIT Press/Bradford Books, Cambridge, MA. 5