Artificial Intelligence for HVAC Systems

Similar documents
Reinforcement Learning. Machine Learning, Fall 2010

Grundlagen der Künstlichen Intelligenz

Q-Learning in Continuous State Action Spaces

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning I Reinforcement Learning

Reinforcement Learning. George Konidaris

Q-learning. Tambet Matiisen

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Reinforcement Learning Part 2

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil

Off-Policy Actor-Critic

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement learning an introduction

Temporal difference learning

Basics of reinforcement learning

Reinforcement Learning II. George Konidaris

Reinforcement Learning

CS230: Lecture 9 Deep Reinforcement Learning

Reinforcement Learning II. George Konidaris

MDP Preliminaries. Nan Jiang. February 10, 2019

Reinforcement Learning

Lecture 3: The Reinforcement Learning Problem

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Reinforcement Learning Active Learning

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

6 Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement learning

Notes on Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Chapter 3: The Reinforcement Learning Problem

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

CS599 Lecture 1 Introduction To RL

Decision Theory: Q-Learning

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Reinforcement Learning II

CS599 Lecture 2 Function Approximation in RL

(Deep) Reinforcement Learning

Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

Internet Monetization

Markov decision processes

Approximation Methods in Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Reinforcement Learning

Lecture 3: Markov Decision Processes

Human-level control through deep reinforcement. Liia Butler

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning: the basics

Dual Temporal Difference Learning

Temporal Difference Learning & Policy Iteration

Chapter 6: Temporal Difference Learning

An Introduction to Reinforcement Learning

On and Off-Policy Relational Reinforcement Learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Animal learning theory

Reinforcement Learning. Introduction

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement Learning

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Reinforcement Learning

Linear Least-squares Dyna-style Planning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time

CS 7180: Behavioral Modeling and Decisionmaking

CSC321 Lecture 22: Q-Learning

Chapter 3: The Reinforcement Learning Problem

An Introduction to Reinforcement Learning

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Lecture 7: Value Function Approximation

A Method for Computing Optimal Set-Point Schedule for HVAC Systems

TAKING INTO ACCOUNT PARAMETRIC UNCERTAINTIES IN ANTICIPATIVE ENERGY MANAGEMENT FOR DWELLINGS

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

David Silver, Google DeepMind

Reinforcement Learning and NLP

Decision Theory: Markov Decision Processes

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Reinforcement learning

Autonomous Helicopter Flight via Reinforcement Learning

Trust Region Policy Optimization

Reinforcement Learning: An Introduction

Transcription:

Artificial Intelligence for HVAC Systems V. Ziebart, ACSS IAMP ZHAW

Motivation More than 80% of the energy consumed in Swiss households is used for room heating & cooling and domestic hot water.* The increasing combined use of various energy conversion and storage technologies (PT, solar thermal collectors, heat pumps, combustion, batteries, hot water storage, ice storage systems) requires intelligent and optimal control systems. Reinforcement Learning (RL) showed promising performance in different fields (AlphaZero, computer games). RL is a data-driven approach. Can RL be used as control method for HVAC**-Systems? (optimality, learning behavior, robustness, * Energieverbrauch in der Schweiz und weltweit, EnergieSchweiz, Bundesamt für Energie BFE Dienst Aus-und Weiterbildung, Juli 2015 ** HVAC: heating, ventilation, air conditioning

Model of Building and Heating System Weather data of 11 Swiss cities T amb T forecast Occupation pattern T air T ch T floor COP of heat pump PID 1 T PID 2 cellar T s1 heat pump T s2 Enery price T soil night rate 0.5 date rate 1

Reinforcement Learning Control policy π qsa (, ) = value function : qπ(, s a) = Επ GtSt = s, At = a k return : Gt = γ R k = 0 discount rate : γ [0,1] t+ k+ 1 Schematic from: Richard S. Sutton and Andrew G. Barto, An introduction to reinforcement learning, 2018

Q-Learning and SARSA Q-learning: ( R max ( ) t + γ q t 1 a) qst, qs (, a) qs (, a) + α s +, ( a ) SARSA: new estimate of return G t t t t t a learning rate ( R+ γ qs (, a ) ) qs (, a) qs (, a) + α + + qs (, a) t t t t t t 1 t 1 t t difference of «large» numbers! These interacting, interdependent, interative jobs must be done: 1. learn q for actions performed in state visited 2. approximate a generalized q-function on state and action space (SxA) (typically with gradient descent) 3. improve control performance by e.g. (ε-)greedy policy potentially unstable

Improvement of Stability Ad 1: More efficient learning of q through n-step SARSA n 1 k n γ k 0 t+ k+ γ = t+ n t+ n Gtn, = R qs (, a ) truncated sum of n rewards instead of Gt,1 = Rt + γ qs ( t+ 1, at+ 1). Average fraction return: n 1 k = 0 k = 0 k γ R k γ R n 1 γ 1 γ = = 1 γ 1 1 γ n Ad 2: Least-Squares fit to {s t,a t,g t,n } with polynomials and trigonometric functions to find q(s,a) on SxA instead of iterative gradient descent methods.

State, Action, Reward State variables for RLC (continous and discrete) 6 temperatures, R 6 (T amb, T air, T floor, T storage1, T storage2, T forecast ) time, real interval [0,24[ room occupancy, boolean Action variables (discrete, 12 combinations) heat pump off/loading storage 1 or 2, {0,1,2} PID floor heating on/off, boolean PID convection heater on/off, boolean Reward ( 0): energy costs (negative, night and day rate) temperature deviation from setpoint, (negative, proportional to T, only if house is occupied)

Approximation of State-Action Value Function q(s,a) Partitioning of state-action space in continous (c) and discrete (d) subspaces: S x A = (S c x S d ) x (A c x A d ) = S c x A c x S d x A d continous discrete For each set of discrete variables in S d x A d (24 configurations) q(s,a) is approximated in the continous subspace S c x A c by a linear combination of polynomials (temperatures) and trigonometric functions (time): 6 nki (,, j) k kjl i l jl, i= 1 q ( s, a) = c T trig ( t) k = 1,...,24 rewritten as flattened vector product q( sa, ) = c p( T,..., T, t) = c p k= 1,...,24 k kj j 1 6 k j typically: j = 252 k j = 6 048 coefficients c kj

Analytical Solution c older experience is discounted by γ Q = arg min R + q s, a, k = 1,...,24 t n 1 t i j n k γq γ sik (, ) + j γ ( sik (, ) + n sik (, ) + n) ckp sik (, ) i= 1 j= 0 2 s(i,k) is the index when (s d,a d ) i =(s d,a d ) k the i-th time rearranging yields: { (, ), (, ), } ( i) ( i 1) k = γ Q k + sik r sik l Q Q p p n 1 ( i) ( i 1) j n k = γq k 2 sik (, ) γ Rsik (, ) + j+ γ q ssik (, ) n, a + sik (, ) + n j= 0 b b p ( ε ) 1 c = 0.5 Q + I b k k k k rl for nummerical reasons ( )

Decision Making, Reward Normalization probability of choosing a i : π ( ai s) = n e j= 1 (, ) qsa e i (, j ) qsa τ τ ( ai s) ( aj s) ( qsai s) τ : π = π τ 0 : π arg max (, ) = 1 qsa (, ) 0! i normalization of reward per year (plots only!) by difference of average outside temperature and room set point temperature R R with T T T T + 0.005 T year year, norm = = 2 sp amb

Hardware & Software 2 Servers with 2 CPUs Intel Xeon Platinum 8164 each Each CPU with 26 cores 768 GB RAM Code written in Mathematica Computation time for 1 year simulation: 1min

SARSA vs Q-Learning

Effect of n in n-step SARSA

Effect of n in n-step SARSA n=1 out of plot range

Influence of Discount Factor γ Actual time 5:00. One more heating hour before 20:00 is needed. What hour seems to be the best choice? (If γ>0.952: heating at 5:00, otherwise heating at 19:00) especially too low γ s lead to suboptimal decisions and unrealistic q-values

n-step SARSA without Discount The discount of future rewards disturbs the optimal scheduling of actions with fixed and known costs. Alternative approach: n-step SARSA without discount n 1 k n γ k 0 t+ k+ γ = t+ n t+ n Gtn, = R qs (, a ) The time horizon is defined by n instead of γ!

n-step SARSA without Discount

Performance of n-step SARSA without Discount optimal range

Parallel «Supporting» Agents, Motivation Could larger scatter be exploited by helping worse agents become better and then by chance even be better than best agent?

Parallel «Supporting» Agents, Structure occupation & weather data R Agent 1 (p hyper,1, ε supp,1 ) Agent 2 (p hyper,2, ε supp,2 ) Agent 3 (p hyper,3 ε supp,3 ) Agent n (p hyper,n ε supp,n ), q best _ agent coeff, best _ agent R1, qcoeff,1 R R 2, qcoeff,2 3, qcoeff,3 control & performance parameter ε R ( ) ( # / min 1 ) i year τ supp c e,1 if # year even R R 0 if # year odd supp supp, i = i best _ agent ε > ε ε < ε supp supp : decision based on own parameters : decision based on best agents parameters needed to evalute performance

Parallel «Supporting» Agents, Results all agents show better performance than best agent without support

Parallel «Supporting» Agents, Results

RL Control Example (1)

RL Control Example (2)

RL Control Example (3) -0.1 *(energy rate) agent heats at cheaper night rate only

Simulated Reinforcement Learning Schematic: Peter Bolt, ACSS IMAP ZHAW

RBC with Parameters Optimized by RL Some simple fixed rules Setpoints for water storage heating 1 & 2 are parametrized, k 1 and k 2 learned Tamb + Tforecast Tsp, heating st1 = Tsp, room + Tsp, room k1 2 T + T T T T k 2 amb forecast sp, heating st 2 = sp, room + sp, room 2 Reward (energy consumption & set point violation) is normalized and thus almost independent of outside temperature: R = R with T = T T T + 0.005 T norm 2 sp amb

RL with Parametrized RBC acceptable performance in less than 1 year!

Summary RLC for a heating system converges to nearly optimal trajectories needs order of 100 simulated years for convergence ( simulated RL) Improvements to RL least-squares fit to get q(s,a) in one step n-step SARSA truncated reward sum without discount parallel, mutually supporting agents Alternative approach with parametrized RBC much shorter learning time due to less coefficients good performance