A quick introduction to reservoir computing

Size: px

Start display at page:

Download "A quick introduction to reservoir computing"

Lee Doyle
6 years ago
Views:

1 A quick introduction to reservoir computing Herbert Jaeger Jacobs University Bremen

2 1 Recurrent neural networks

3 Feedforward and recurrent ANNs A. feedforward B. recurrent Characteristics: Has at least one feedback loop of connections Thus can maintain dynamic activation even without input input time series output time series can approximate any dynamical system (universal approximation property, Funahashi & Nakamura 1993) mathematical analysis difficult learning algorithms computationally expensive and difficult to master few application-oriented publications, little research

4 Hopfield networks, Boltzmann machines, Deep Belief Networks, ART networks, etc. These are also recurrent. Differences: Symmetric connections Energy-based models with equilibrium / stochastic sampling dynamics For pattern recognition and associative memory tasks, -- not for time series processing (only if forced to...)

5 Problems with "classical" gradientbased training methods for RNNs Besides the generic difficulties of gradient-descent methods, in RNNs there are some more: passing through bifurcations error gradient information shrinks exponentially over time long memory effects difficult to train ("long short-term memory" (LSTM) RNNs are an exception, Gers et al 2000) algorithms and maths are difficult experienced professionals needed But: powerful in the hands of "deep learning" experts who have learnt how to tame them

6 2 Echo State Networks Basics

7 Echo state networks Developed independently and simultaneously with Liquid State Machines (LSM) (Wolfgang Maass, TU Graz) Basic idea of ESNs and LSMs: Use a large, random, fixed RNN (called dynamical reservoir, is not changed by training!) Train only the connections from reservoir to output units! traditional, e.g. BPTT: all connections are trained. ESN training: train only output weights

8 Illustrating the principle: training a tone generator Goal: train a network to work as a tuneable tone generator input: frequency setting output: sines of desired frequency

9 Phase 1: internal state collection Drive fixed "reservoir" network with teacher input and output, and save reservoir units time signals. Observation: internal states of dynamical reservoir reflect and modify both input and output teacher signals

10 Phase 2 (and finish): compute output weights Determine reservoir-to-output weights such that training output is optimally reconstituted from internal "echo" signals in a least mean square error sense. This is a linear regression on the teacher output signal (red) on the reservoir signals.

11 Tone generator: exploitation With new output weights in place, drive trained network with input. Observation: network continues to function as in training. internal states reflect input and output output is reconstituted from internal states internal states and output constitute each other

12 Dynamical reservoir large recurrent network ( units) works as "dynamical reservoir" output units combine different internal dynamics into desired dynamics i n p u t u n i t r e c u r r e n t " d y n a m i c a l r e s e r v o i r " o u t p u t u n i t s

13 3 Basic Examples Predicting chaotic timeseries Event detection for robots Speaker identification ("Japanese Vouwel" benchmark) Financial timeseries prediction Learning a fairytale (rather, its text statistics) Nonlinear control (pendulum, robot drive)

14 x ( t) The Mackey-Glass attractor = 0.2 x( t τ) / (1 + x( t τ) 10 ) 0.1x( t) delay differential equation delay τ > 16.8: chaotic widely used benchmark for time series prediction

15 Learning setup network size 1000 training sequence N = 3000 sampling rate 1

16 Prediction with model visible discrepancy after about 1500 steps

17 Comparison: NRMSE for 84-step prediction -5,1-4,2-1,7-1,7-1,7 log10(nrmse) -1,3-1,3-1,2-1,2 ESN (refined) -5,1 ESN (1+2 K) -4,2 PCR Local Model (McNames 99, 2 K) -1,7 SOM (Vesanto 97, 3K) -1,7 DCS-LLM (Chudy & Farkas 98, 3K) -1,7 AMB (Bersini et al 98,? K) * -1,3 Neural Gaz (Martinez et al 93, ~4K) -1,3 EPNet (Yao & Liu 97, 0.5 K) -1,2 BPNN (Lapedes & Farber 87,? K) * -1,2 (Jaeger & Haas 2004)

18 Event detection for robots (joint work with J.Hertzberg & F. Schönherr) Robot drives through office environment, experiences data streams (27 channels) like... infrared distance sensor left motor speed activation of "gothrudoor" external teacher signal, marking event category 10 sec

19 Learning setup 27 (raw) data channels unlimited number of event 100 unit RNN detector channels simulated robot (rich simulation) training run spans 15 simulated minutes event categories like pass through door pass by 90 corner pass by smooth corner

20 Results easy to train event hypothesis signals "boolean" categories possible single-shot learning possible

21 Japanese vowels: task benchmark data set from UCI KDD repository 1) 9 male Japanese speakers, samples of two-vowel utterance /ae/, represented by 12 LPC cepstrum coefficients task: discriminate speakers (learn on 9 x 30 samples, test on 370 samples in total, unevenly distributed) Figure: exemplary samples from the 9 speakers 1)

22 Japanese vowels: training setup Training No. 2 speaks input: 12 speech signal coefficients 100 unit augmented ESN basic idea: output: 9 speaker indicators No. 9 wins Testing

23 Japanese vowels: results discrimination learning original result (Kudo et al 1999) [5-unit continuous HMM]: 96.2 % correct (14 errors) previous best result (Strickert 2004): [self-organizing "merge neural gas", 1000 neurons]: 98.4 % correct (6 errors) ESN (Jaeger, Lukosevicius and Popovici 2007, refined architecture): zero test errors

Financial time series prediction NN3 financial time series forecasting competition, 2007 (http://www.neural-forecasting-competition.

24 Financial time series prediction NN3 financial time series forecasting competition, 2007 ( 111 short financial time series, origin withheld intentionally of varied nature Took part with machine learning graduate seminar at Jacobs University ( hjaeger/ TimeSeriesCompetitionOverview.html, using ESNs... and we won -!

25 Little Red Riding Hood training data: 3412-symbol sequence, shown here: first and last 500 once_upon_a_time_there_was_a_little_village_girl,_the_prettiest_ever_seen_her_mot her_doted_upon_her,_and_so_did_her_grandmother._she,_good_woman,_made_for_h er_a_little_red_hood_which_suited_her_so_well,_that_everyone_called_her_little_red _riding_hood._one_day_her_mother,_who_had_just_made_some_cakes,_said_to_her _my_dear,_you_shall_go_and_see_how_your_grandmother_is,_for_i_have_heard_she _is_ailing,_take_her_this_cake_and_this_little_pot_of_butter._little_red_riding_hood_ started_off_at_once_for_he oh,_grandmamma,_grandmamma,_what_great_arms_you_have_got_all_the_better_to _hug_you_with,_my_dear_oh,_grandmamma,_grandmamma,_what_great_legs_you_h ave_got_all_the_better_to_run_with,_my_dear_oh,_grandmamma,_grandmamma,_wh at_great_eyes_you_have_got_all_the_better_to_see_with,_my_dear_oh,_grandmamm a,_grandmamma,_what_great_teeth_you_have_got_all_the_better_to_gobble_you_up _so_saying,_the_wicked_wolf_leaped_on_little_red_riding_hood_and_gobbled_her_u p._here_endeth_the_tale_of_little_red_riding_hood.

26 Network setup in training 29 input channels code symbols 29 output channels for next symbol hypotheses _ a z 400 units

27 Trained network in "text" generation winning symbol is next input......!! decision mechanism, e.g. winner-take-all

28 Results Selection by random draw according to output yth_upsghteyshhfakeofw_io,l_yodoinglle_d_upeiuttytyr_hsymua_doey_sa mmusos_trll,t.krpuflvek_hwiblhooslolyoe,_wtheble_ft_a_gimllveteud_... Winner-take-all selection sdear_oh,_grandmamma,_who_will_go_and_the_wolf_said_the_wolf_said _the_wolf_said_the_wolf_said_the_wolf_said_the_wolf_said_the_wolf...

29 Results, continued Selection by nonlinearly weighted random draw (namely, out^3.5) d_wolf_said_the_better_to_her_the_wood_the_wolf_she urter_that_of_butter_to_he r_grandmother,_the_door_grandmamma,_who_was_bed,_she_the_better_the_wolf_sa_ and_she_little_red_her_grandmatm_aa_grandmother_mother_grandmother_mother_go od_wolf,_and_the_wolf_so_said_the_she_wolf,_and_i_have_gs_at_the_wolf,_wolf_bu tter_to_her_come_neard_the_bobbled_her_grandmamma,_grandmamma,_who_her_the _do_wolf_cake_her_grandmother_mother_to_her_to_she_me_the_better_to_her_the_b ettle_red_riding_hood_see_her_the_pot_of_butter_...

30 Direct / feedback tracking control training exploitation y( t) y( t-d) z -d u( t-d) pendulum pendulum y( t) z -d u( t) yref ( t+d) u( t) network "observes" torque u(t-d), plant state y(t-d), y(t) network input: reference y ref (t+d) and state feedback y(t) network learns how u(t) depends on y(t), y(t+d) network computes control input u(t)

31 Simulated pendulum example training data torque u exploitation control signal u issued by network [ ] disturb. [ ] pendulum coord. y y actual [ ] reference [ ] plus coord. x, plus (in state feedback) actual [ ] reference [ ]

32 Robot motor control: training and using the controller simulated robot wheel speeds Desired robot wheel speeds y( t-2) z -2 y( t) y( t) y ( ref t+2 ) u( t) z -2 u( t) u( t-2) previously trained ESN model of robot training data points (about 5 min robot run) 100 unit network original PWM robot data 1) (left & right) Controller network with output feedback NMSE about ) data courtesy Paul Plöger et al.

33 Results Original controller hand-designed PID (feedback) controller controlled trajectory lags behind target Target Controlled Control input Trained controller run on target generated by reference model Target Controlled Control input

34 5.2 Stable short-term memory

35 Stable short-term memory Motivation. Reservoir memory is transient, but we need stable dynamical STM as well In single-word speech/writing recognition: good to remember letter from beginning of word at end In phrase processing: good to remember initial words / grammatical categories

36 Example and architecture Task: count opening and closing brackets in text Text: random sequence of distorted {, } and 82 distractor symbols, in 12-pixel-high grayscale image Idea: for each opening "{", flip one further output neuron from -0.5 to dim grayscale vector y 1 y 2 # of open { y 1 y neuron reservoir trained fixed

37 Performance Red lines: output values Green lines: targets 0.19 % errors/character with 6 bracket levels (Pascanu & Jaeger 2010) Current Jacobs: mathematics of attractors in input-driven systems (Work and figures by Razvan Pascanu)

38 References Atiya, A. F., Parlos, A. G. (2000), New results on recurrent network training: Unifying the algorithms and accelerating convergence. IEEE Transactions on Neural Networks, 11(3): Funahashi, K.-I., Nakamura, Y. (1993), Approximation of dynamical systems by continuous time recurrent neural networks. Neural Networks 6, 1993, Gers, F.A., Schmidhuber, J., Cummins, F. (2000), Learning to forget: continual prediction with LSTM. Neural Computation 12(10), Hermans, M., Schrauwen, B. (2009): Memory in linear recurrent neural networks in continuous time. Neural Networks Jaeger, H. (2001): The "echo state" approach to analysing and training recurrent neural networks. GMD Report 148, German National Research Center for Information Technology, 2001Jaeger, H. (2001): Short term memory in echo state networks. GMD Report 152, German National Research Center for Information Technology, 2001 H. Jaeger (2006): Generating exponentially many periodic attractors with linearly growing Echo State Networks. IUB technical report Nr. 3 Jaeger, H. (2007), Discovering multiscale dynamical features with hierarchical Echo State Networks. Technical Report 10, School of Engineering and Science, Jacobs University. H., Haas, H. (2004), Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication. Science 304, 2004, Jaeger, H. (2010): Reservoir Self-Control for Achieving Invariance Against Slow Input Distortions. Jacobs University technical report Nr. 23 Jaeger, H., Haas, H. (2004), Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication. Science 304, 2004, Jaeger, H., Lukosevicius, M. Popovici, D. (2007): Optimization and Applications of Echo State Networks with Leaky Integrator Neurons. Neural Networks 20(3), , 2007 Jaeger, H., Maass, W.,Principe, J. (2007), Special issue on echo state networks and liquid state machines. Neural Networks 20(3), 2007, Kudo, M., Toyama, J., Shimbo, M. (1999), Multidimensional Curve Classification Using Passing-Through Regions. Pattern Recognition Letters, Vol. 20, No , pages Lukosecicius, M., Jaeger, H. (2009, to appear): Reservoir Computing Approaches to Recurrent Neural Network Training. Computer Science Review Maass, W., Natschlaeger, T., Markram, H. (2002), Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations. Neural Computation 14(11), 2002, Maass, W., Joshi, P., Sontag, E. (2007), Computational aspects of feedback in neural circuits. PLOS Computational Biology 3(1), 2007, 1-20 Pascanu, R. and Jaeger, H. (2010): A Neurodynamical Model for Working Memory. Neural Networks; DOI /j.neunet Strickert, M. (2004). Self-organizing neural networks for sequence processing. Ph.D. thesis. Univ. of Osnabrück, Dpt. of Computer Science. Diss384 thesis.pdf. White, O. L., Lee, D. D., Sompolinsky, H. S. (2004), Short-term Memory in Orthogonal Neural Networks. Phys. Rev. Lett. 92(14), 2004,

International University Bremen Guided Research Proposal Improve on chaotic time series prediction using MLPs for output training

International University Bremen Guided Research Proposal Improve on chaotic time series prediction using MLPs for output training Aakash Jain a.jain@iu-bremen.de Spring Semester 2004 1 Executive Summary