# Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil

Save this PDF as:

Size: px
Start display at page:

Download "Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil"

## Transcription

2 Heating Coil with Dual Controller T ai Temp Air In f a Mass Flow Air T ao Temp Air Out Duct Temp H20 In T wi c P Controller Temp H20 Out f w Mass Flow Rate H20 T wo Neural Network Temp Set Point Other Variables Figure 1: The simulated heating coil. Also shown is the control system composed of a feedback controller and a neural network whose outputs are summed to form the control signal. disturbances and changing conditions that would occur in actual heating and air conditioning systems. The bounds on the random walks were 4 T ai 10 o C, 73 T wi 81 o C, and 0:7 f a 0:9 kg/s. A PI controller was tuned to control the simulated heating coil. The best proportional and integral gains were determined by measuring the sum of the absolute value of the difference between the set point and actual exiting air temperature under a variety of disturbances. 3 Training with a Simple Inverse Model The most immediate problem that arises when designing a procedure for training a neural-network controller is that a desired output for the network is usually not available. One way to obtain such an error is to transform the error in the controlled state variable, which is usually a difference from a set point, into a control-signal error, and use this error to train the network. An inverse model of the controlled system is needed for this transformation. In addition to an inverse model, one must consider which error in time is to be minimized. Ideally, a sum of the error over time is minimized. This is addressed in Section 5. Here the error n steps ahead is minimized. A neural network was trained to reduce the error in the output air temperature n steps ahead. For the simulated heating coil, a step is five seconds. This is the assumed sampling interval for a digital controller that implements any of the control schemes tried here. The network was used as an independent controller, with the output of the network being the control signal. To train an n-step ahead network, a simple inverse model was assumed: the error of the network s output at time step t was defined to be proportional to the difference between the output air temperature and the set point at time t + n. Inputs to the network were T ai, T ao, T wi, T wo, f a, f w, and the set point at time t. The network had these seven inputs, 30 hidden units, and one output unit. Training data was generated by randomly changing the set point and system disturbances every 30 time steps. Neural networks were trained to minimize the error from two to eight steps ahead. All were found to reduce the error in comparison to the PI controller by itself. For n =1, 2, and 3, the RMS error between T ao and its set point was about 0.6 over a 300-step test sequence. For n =5and 6, the error increased to approximately 0.7 and 0.65, respectively. Figure 2 shows a 60-step portion of the test sequence. Smaller values of n result in more overshoot of the set point. The performance of the PI controller as shown is significantly damped. Recall that the PI proportional and integral gains were chosen to minimize the RMS error over a wide variety of disturbances.

4 0.12 Average Squared Error Hidden Units 3 Hidden Units 6 Hidden Units Hidden Unit 2 Hidden Units Epochs Figure 3: Average squared error versus epochs for various numbers of hidden units. component is missing. When combined with a proportional controller, whose gain is optimized for this case, performance was significantly better than the original PI controller s performance. Figure 4 shows the behavior of the exiting air temperature when controlled with PI controller, the neural network alone, and with the combined neural network and proportional controller. This successful simulation result in not surprising, since well-behaved physical models are modeled well by statistical methods. However, our approach is more general than an analytical solution. It can be used for systems for which a model is not well known and for situations where systems constants need to be determined by regression. 5 Optimal Control with Q-Learning Reinforcement learning methods embody a general Monte Carlo approach to dynamic programming for solving optimal control problems. Q-learning procedures converge on value functions for state-action pairs that estimate the expected sum of future reinforcements, which reflect behavior goals that might involve costs, errors, or profits. To define the Q-learning algorithm, we start by representing a system to be controlled as consisting of a discrete state space, S, and a finite set of actions, A, that can be taken in all states. A policy is defined by the probability, (s t ;a), that action a will be taken in state s t. Let the reinforcement resulting from applying action a t while the system is in state s t be R(s t ;a t ). Q (s t ;a t ) is the value function given state s t and action a t, assuming policy governs action selection from then on. The optimal value of Q (s t ;a t )is Q (s t ;a t )=E ( TX k=0 k R(s t+k ;a t+k ) where is a discount factor between 0 and 1 that weights reinforcement received sooner more heavily than reinforcement received later. This expression can be rewritten as an immediate reinforcement plus a sum of future reinforcement: Q (s t ;a t ) = E (R(s t ;a t )+ = E (R(s t ;a t )+ TX k=1 T,1 X k=0 ) k R(s t+k ;a t+k ) ; ) k R(s t+k+1 ;a t+k+1 ) In dynamic programming, policy evaluation is conducted by iteratively updating the value function until it converges on the desired sum. By substituting the estimated value function for the sum in the above equation, the iterative policy evaluation method from dynamic programming results in the following update to the current estimate of the value function: Q (s t ;a t ) = E fr(s t ;a t )+Q (s t+1 ;a t+1 )g,q (s t ;a t ); ; ) :

5 50 T 45 ao a. PI Controller T ao T 45 ao 40 b. Neural Network Controller 35 c. Combined Neural Network and PI Controller Figure 4: Set Point and actual temperature versus time as the system is controlled with a) the PI controller, b) the neural network, steady-state predictor, and c) the combined neural network predictor and proportional controller.

6 50 T ai 40 c (control 1000 signal) Figure 5: Performance of PI controller. The top graph shows the set point and the actual output air temperature over time. The bottom graph shows the PI output control signal. where the expectation is taken over possible next states, s t+1, given that the current state is s t and action a t was taken. This expectation requires a model of state transition probabilities. If such a model does not exist, a Monte Carlo approach can be used in which the expectation is replaced by a single sample and the value function is updated by a fraction of the difference. Q (s t ;a t ) = [R(s t ;a t )+Q (s t+1 ;a t+1 ), Q (s t ;a t )] ; where 0 1. The term within brackets is often referred to as a temporal-difference error, defined by Sutton [8]. To improve the action-selection policy and achieve optimal control, the dynamic programming method called value iteration can be applied. This method combines steps of policy evaluation with policy improvement. Assuming we want to maximize total reinforcement, as would be the case if reinforcements are positive, such as profits or proximity to a destination, the Monte Carlo version of value iteration for the Q function is Q (s t ;a t )= R(s t ;a t )+max Q (s t+1 ;a 0 ),Q (s t ;a t ) a 0 2A This is what has become known as the Q-learning algorithm. Watkins [10] proves that it does converge to the optimal value function, meaning that selecting the action, a, that maximizes Q(s t ;a)for any state s t will result in the optimal sum of reinforcement over time. 5.1 Experiment Reinforcement at time step t is determined by the squared error plus a term proportionalto the magnitude of the action change from one time step to the next. Formally, this reinforcement is calculated by R(t) =(T ao (t), T ao (t)) 2 ; where T ao is the set point. Input to the reinforcement-learning agent consist of the variables T ai, T ao, T wi, T wo, f a, f w,andtheset point, all at time t. The output of the network is directly added to the integrated output of the PI controller. The allowed output actions are,100,,50,,20,,10, 0, 10, 20, 50, and 100. The selected action is added to the PI control signal. The Q function is implemented by the network in a quantized, or table look-up, method. Each of the seven input variables are divided into six intervals, which quantize the 7-dimensional input space into 6 7 hypercubes. In each hypercube is stored the Q values for the nine possible actions. The parameter was decreased during training from 0.1 to The reinforcement-learning agent is trained repeatedly on 500 time steps of feedback interactions between the PI controller and the simulated heating coil. Figure 5 shows the set point and actual temperatures and the control signal over these 500 steps. The RMS error between the set point and actual output air temperature over these 500 steps is :

7 RMS Error Between Setpoint and Tao PI Controller RMS Error Epochs Figure 6: Reduction in error with multiple training epochs. The amount of exploration is reduced during training. 5.2 Result The reduction in RMS error while training the reinforcement-learning agent is shown in Figure 6. After approximately 500 epochs of training, the average error between the set point and actual temperature was reduced to below the level achieved by the PI controller alone. It appears that further error reduction will not occur. This remaining error is probably due to the random walk taken by the disturbances and, if so, cannot be reduced. The resulting behavior of the controlled air temperature is shown in Figure 7. The middle graph shows the output of the reinforcement-learningagent. It has learned to be silent (an output of 0) for most time steps. It produces large outputs at set point changes, and at several other time steps. The combined reinforcementlearning agent output and PI output is shown in the bottom graph. 6 Conclusions Three independent strategies for combining neural network learning algorithms with PI controllers were successfully shown to result in more accurate tracking of the temperature set point of a simulated heating coil. Variations of each strategy are being studied further and ways of combining them are being developed. For example, the reinforcement learner can be added to any existing control system, including the combined steady-state predictor and PI controller. Additional investigations continue on the state representation [1, 2], the set of possible reinforcementlearning agent actions, different neural network architectures, and on strategies for slowly decreasing the amount of exploration performed by the reinforcement-learning agent during learning. Plans include the evaluation of this technique on a physical heating coil being controlled with a PC and control hardware. Acknowledgements This work was supported by the National Science Foundation through grant CMS References [1] Charles W. Anderson. Q-learning with hidden-unit restarting. In Stephen Jose Hanson, Jack D. Cowan, and C. Lee Giles, editors, Advances in Neural Information Processing Systems 5, pages Morgan Kaufmann Publishers, San Mateo, CA, [2] Charles W. Anderson and S. G. Crawford-Hines. Multigrid Q-learning. Technical Report CS , Colorado State University, Fort Collins, CO, [3] Peter S. Curtiss. Experimental results from a network-assisted PID controller. ASHRAE Transactions, 102(1), 1996.

8 50 T 40 ao Output of Reinforcement Learning Agent c (PI Control output plus 1000 output of RLA) 500 Figure 7: Performance of combined reinforcement-learning agent and PI controller. Top graph shows the set point and the actual output air temperature over time. The second graph shows the output of the reinforcement-learning agent. The third graph shows the combined PI output and reinforcement-learning agent output. [4] Peter S. Curtiss, Gideon Shavit, and Jan F. Krieder. Neural networks applied to buildings a tutorial and case studies in prediction and adaptive control. ASHRAE Transactions, 102(1), [5] S. J. Hepworth and A. L. Dexter. Neural control of a non-linear HVAC plant. Proceedings 3rd IEEE Conference on Control Applications, [6] S. J. Hepworth, A. L. Dexter, and S. T. P. Willis. Control of a non-linear heater battery using a neural network. Technical report, University of Oxford, [7] Minoru Kawashima, Charles E. Dorgan, and John W. Mitchell. Optimizing system control with load prediction by neural networks for an ice-storage system. ASHRAE Transactions, 102(1), [8] R. S. Sutton. Learning to predict by the method of temporal differences. Machine Learning, 3:9 44, [9] David M. Underwood and Roy R. Crawford. Dynamic nonlinear modeling of a hot-water-to-air heat exchanger for control applications. ASHRAE Transactions, 97(1): , [10] C. Watkins. Learning with Delayed Rewards. PhD thesis, Cambridge University Psychology Department, 1989.

### Synthesis of Reinforcement Learning, Neural Networks, and PI Control Applied to a Simulated Heating Coil

Synthesis of Reinforcement Learning, Neural Networks, and PI Control Applied to a Simulated Heating Coil Charles W. Anderson 1, Douglas C. Hittle 2, Alon D. Katz 2, and R. Matt Kretchmar 1 1 Department

### T ai Temp Air In Duct Temp Set Point f a Mass Flow Air T wo Temp H 2O Out PI Controller T ao Temp Air Out T wi Temp H 2O In f w Mass Flow Rate H 2O Fi

Neural Networks and PI Control using Steady State Prediction Applied to a Heating Coil CHRIS C. DELNERO, Department of Mechanical Engineering, Colorado State University DOUGLAS C. HITTLE, Department of

### Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

### Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

### Replacing eligibility trace for action-value learning with function approximation

Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one

### ECE Introduction to Artificial Neural Network and Fuzzy Systems

ECE 39 - Introduction to Artificial Neural Network and Fuzzy Systems Wavelet Neural Network control of two Continuous Stirred Tank Reactors in Series using MATLAB Tariq Ahamed Abstract. With the rapid

### CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

### Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

Proceedings of the International Conference on Neural Networks, Orlando Florida, June 1994. REINFORCEMENT LEARNING IN CONTINUOUS TIME: ADVANTAGE UPDATING Leemon C. Baird III bairdlc@wl.wpafb.af.mil Wright

### Replacing eligibility trace for action-value learning with function approximation

Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one

### Robust Controller Design for Speed Control of an Indirect Field Oriented Induction Machine Drive

Leonardo Electronic Journal of Practices and Technologies ISSN 1583-1078 Issue 6, January-June 2005 p. 1-16 Robust Controller Design for Speed Control of an Indirect Field Oriented Induction Machine Drive

### CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

### Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

### An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

### Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018 OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence. OpenAI OpenAI is a non-profit

### Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning

### In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

### Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

### Different Criteria for Active Learning in Neural Networks: A Comparative Study

Different Criteria for Active Learning in Neural Networks: A Comparative Study Jan Poland and Andreas Zell University of Tübingen, WSI - RA Sand 1, 72076 Tübingen, Germany Abstract. The field of active

### I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

### arxiv: v1 [cs.ai] 5 Nov 2017

arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important

### ARTIFICIAL INTELLIGENCE. Reinforcement learning

INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

### Q-learning. Tambet Matiisen

Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience

### Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

### Reinforcement Learning: the basics

Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

### Generalization and Function Approximation

Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

### Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

### Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29

### Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

### An application of the temporal difference algorithm to the truck backer-upper problem

ESANN 214 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 23-25 April 214, i6doc.com publ., ISBN 978-28741995-7. Available

### Learning in State-Space Reinforcement Learning CIS 32

Learning in State-Space Reinforcement Learning CIS 32 Functionalia Syllabus Updated: MIDTERM and REVIEW moved up one day. MIDTERM: Everything through Evolutionary Agents. HW 2 Out - DUE Sunday before the

### Temporal difference learning

Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

### ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

### Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

### EEE 241: Linear Systems

EEE 4: Linear Systems Summary # 3: Introduction to artificial neural networks DISTRIBUTED REPRESENTATION An ANN consists of simple processing units communicating with each other. The basic elements of

### Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

### Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu

### Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

### RL Lecture 6: Temporal Difference Learning

RL Lecture 6: emporal Difference Learning Objectives of this lecture: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods R.

### Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

### MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

### 2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks Todd W. Neller Machine Learning Learning is such an important part of what we consider "intelligence" that

### domain. The terms needed in the solution grew exceedingly numerous as the number of coil passes increased. The finite element models of B.A. Reichert

Exact Solution to the Governing PDE of a Hot Water to Air Finned Tube Cross Flow Heat Exchanger Chris C. Delnero a, Douglas C. Hittle a, Charles W. Anderson b,peter M. Young c, Dave Dreisigmeyer d, Michael

### 6 Reinforcement Learning

6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

### Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

### Reinforcement Learning. George Konidaris

Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

### Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

### Sequential Decision Problems

Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

### QUICR-learning for Multi-Agent Coordination

QUICR-learning for Multi-Agent Coordination Adrian K. Agogino UCSC, NASA Ames Research Center Mailstop 269-3 Moffett Field, CA 94035 adrian@email.arc.nasa.gov Kagan Tumer NASA Ames Research Center Mailstop

### Chapter 6: Temporal Difference Learning

Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods R. S.

### Lecture 1: March 7, 2018

Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

### Payments System Design Using Reinforcement Learning: A Progress Report

Payments System Design Using Reinforcement Learning: A Progress Report A. Desai 1 H. Du 1 R. Garratt 2 F. Rivadeneyra 1 1 Bank of Canada 2 University of California Santa Barbara 16th Payment and Settlement

### REINFORCEMENT LEARNING

REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

### Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

### Reinforcement Learning

Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

### ( t) Identification and Control of a Nonlinear Bioreactor Plant Using Classical and Dynamical Neural Networks

Identification and Control of a Nonlinear Bioreactor Plant Using Classical and Dynamical Neural Networks Mehmet Önder Efe Electrical and Electronics Engineering Boðaziçi University, Bebek 80815, Istanbul,

### Artificial Intelligence for HVAC Systems

Artificial Intelligence for HVAC Systems V. Ziebart, ACSS IAMP ZHAW Motivation More than 80% of the energy consumed in Swiss households is used for room heating & cooling and domestic hot water.* The increasing

In: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics,

### AI Programming CS F-20 Neural Networks

AI Programming CS662-2008F-20 Neural Networks David Galles Department of Computer Science University of San Francisco 20-0: Symbolic AI Most of this class has been focused on Symbolic AI Focus or symbols

### CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

### Elements of Reinforcement Learning

Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

### Virtual Sensor Technology for Process Optimization. Edward Wilson Neural Applications Corporation

Virtual Sensor Technology for Process Optimization Edward Wilson Neural Applications Corporation ewilson@neural.com Virtual Sensor (VS) Also known as soft sensor, smart sensor, estimator, etc. Used in

### The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

CSE 190: Reinforcement Learning: An Introduction Chapter 7: Eligibility races Acknowledgment: A good number of these slides are cribbed from Rich Sutton he Book: Where we are and where we re going Part

### PROPORTIONAL-Integral-Derivative (PID) controllers

Multiple Model and Neural based Adaptive Multi-loop PID Controller for a CSTR Process R.Vinodha S. Abraham Lincoln and J. Prakash Abstract Multi-loop (De-centralized) Proportional-Integral- Derivative

### Reinforcement Learning and Control

CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

### Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Monte Carlo is important in practice CSE 190: Reinforcement Learning: An Introduction Chapter 6: emporal Difference Learning When there are just a few possibilitieo value, out of a large state space, Monte

### 1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

### Combination of M-Estimators and Neural Network Model to Analyze Inside/Outside Bark Tree Diameters

Combination of M-Estimators and Neural Network Model to Analyze Inside/Outside Bark Tree Diameters Kyriaki Kitikidou, Elias Milios, Lazaros Iliadis, and Minas Kaymakis Democritus University of Thrace,

### Chapter 8: Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview

### Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch

### Weight Initialization Methods for Multilayer Feedforward. 1

Weight Initialization Methods for Multilayer Feedforward. 1 Mercedes Fernández-Redondo - Carlos Hernández-Espinosa. Universidad Jaume I, Campus de Riu Sec, Edificio TI, Departamento de Informática, 12080

### CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision

### Introduction to Reinforcement Learning

CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

### Animal learning theory

Animal learning theory Based on [Sutton and Barto, 1990, Dayan and Abbott, 2001] Bert Kappen [Sutton and Barto, 1990] Classical conditioning: - A conditioned stimulus (CS) and unconditioned stimulus (US)

### Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

### Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning

### Confidence Estimation Methods for Neural Networks: A Practical Comparison

, 6-8 000, Confidence Estimation Methods for : A Practical Comparison G. Papadopoulos, P.J. Edwards, A.F. Murray Department of Electronics and Electrical Engineering, University of Edinburgh Abstract.

### Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

### 4. Multilayer Perceptrons

4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

### Short Term Load Forecasting Using Multi Layer Perceptron

International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Short Term Load Forecasting Using Multi Layer Perceptron S.Hema Chandra 1, B.Tejaswini 2, B.suneetha 3, N.chandi Priya 4, P.Prathima

### Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

### Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

### RL 3: Reinforcement Learning

RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

### Reinforcement Learning

Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

### A STATE-SPACE NEURAL NETWORK FOR MODELING DYNAMICAL NONLINEAR SYSTEMS

A STATE-SPACE NEURAL NETWORK FOR MODELING DYNAMICAL NONLINEAR SYSTEMS Karima Amoura Patrice Wira and Said Djennoune Laboratoire CCSP Université Mouloud Mammeri Tizi Ouzou Algeria Laboratoire MIPS Université

### Ensembles of Extreme Learning Machine Networks for Value Prediction

Ensembles of Extreme Learning Machine Networks for Value Prediction Pablo Escandell-Montero, José M. Martínez-Martínez, Emilio Soria-Olivas, Joan Vila-Francés, José D. Martín-Guerrero IDAL, Intelligent

POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006 Outline 1 Introduction What is Reinforcement Learning? Types

### ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

### An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

### ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann

ECLT 5810 Classification Neural Networks Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann Neural Networks A neural network is a set of connected input/output

### On and Off-Policy Relational Reinforcement Learning

On and Off-Policy Relational Reinforcement Learning Christophe Rodrigues, Pierre Gérard, and Céline Rouveirol LIPN, UMR CNRS 73, Institut Galilée - Université Paris-Nord first.last@lipn.univ-paris13.fr

### Bidirectional Representation and Backpropagation Learning

Int'l Conf on Advances in Big Data Analytics ABDA'6 3 Bidirectional Representation and Bacpropagation Learning Olaoluwa Adigun and Bart Koso Department of Electrical Engineering Signal and Image Processing

### A Robust PCA by LMSER Learning with Iterative Error. Bai-ling Zhang Irwin King Lei Xu.

A Robust PCA by LMSER Learning with Iterative Error Reinforcement y Bai-ling Zhang Irwin King Lei Xu blzhang@cs.cuhk.hk king@cs.cuhk.hk lxu@cs.cuhk.hk Department of Computer Science The Chinese University

### Reinforcement Learning II

Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

### The Markov Decision Process Extraction Network

The Markov Decision Process Extraction Network Siegmund Duell 1,2, Alexander Hans 1,3, and Steffen Udluft 1 1- Siemens AG, Corporate Research and Technologies, Learning Systems, Otto-Hahn-Ring 6, D-81739

### Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

### Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

### Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

### Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous