Artificial Intelligence for HVAC Systems

Size: px

Start display at page:

Download "Artificial Intelligence for HVAC Systems"

Angela York
5 years ago
Views:

1 Artificial Intelligence for HVAC Systems V. Ziebart, ACSS IAMP ZHAW

2 Motivation More than 80% of the energy consumed in Swiss households is used for room heating & cooling and domestic hot water.* The increasing combined use of various energy conversion and storage technologies (PT, solar thermal collectors, heat pumps, combustion, batteries, hot water storage, ice storage systems) requires intelligent and optimal control systems. Reinforcement Learning (RL) showed promising performance in different fields (AlphaZero, computer games). RL is a data-driven approach. Can RL be used as control method for HVAC**-Systems? (optimality, learning behavior, robustness, * Energieverbrauch in der Schweiz und weltweit, EnergieSchweiz, Bundesamt für Energie BFE Dienst Aus-und Weiterbildung, Juli 2015 ** HVAC: heating, ventilation, air conditioning

3 Model of Building and Heating System Weather data of 11 Swiss cities T amb T forecast Occupation pattern T air T ch T floor COP of heat pump PID 1 T PID 2 cellar T s1 heat pump T s2 Enery price T soil night rate 0.5 date rate 1

4 Reinforcement Learning Control policy π qsa (, ) = value function : qπ(, s a) = Επ GtSt = s, At = a k return : Gt = γ R k = 0 discount rate : γ [0,1] t+ k+ 1 Schematic from: Richard S. Sutton and Andrew G. Barto, An introduction to reinforcement learning, 2018

5 Q-Learning and SARSA Q-learning: ( R max ( ) t + γ q t 1 a) qst, qs (, a) qs (, a) + α s +, ( a ) SARSA: new estimate of return G t t t t t a learning rate ( R+ γ qs (, a ) ) qs (, a) qs (, a) + α + + qs (, a) t t t t t t 1 t 1 t t difference of «large» numbers! These interacting, interdependent, interative jobs must be done: 1. learn q for actions performed in state visited 2. approximate a generalized q-function on state and action space (SxA) (typically with gradient descent) 3. improve control performance by e.g. (ε-)greedy policy potentially unstable

6 Improvement of Stability Ad 1: More efficient learning of q through n-step SARSA n 1 k n γ k 0 t+ k+ γ = t+ n t+ n Gtn, = R qs (, a ) truncated sum of n rewards instead of Gt,1 = Rt + γ qs ( t+ 1, at+ 1). Average fraction return: n 1 k = 0 k = 0 k γ R k γ R n 1 γ 1 γ = = 1 γ 1 1 γ n Ad 2: Least-Squares fit to {s t,a t,g t,n } with polynomials and trigonometric functions to find q(s,a) on SxA instead of iterative gradient descent methods.

7 State, Action, Reward State variables for RLC (continous and discrete) 6 temperatures, R 6 (T amb, T air, T floor, T storage1, T storage2, T forecast ) time, real interval [0,24[ room occupancy, boolean Action variables (discrete, 12 combinations) heat pump off/loading storage 1 or 2, {0,1,2} PID floor heating on/off, boolean PID convection heater on/off, boolean Reward ( 0): energy costs (negative, night and day rate) temperature deviation from setpoint, (negative, proportional to T, only if house is occupied)

8 Approximation of State-Action Value Function q(s,a) Partitioning of state-action space in continous (c) and discrete (d) subspaces: S x A = (S c x S d ) x (A c x A d ) = S c x A c x S d x A d continous discrete For each set of discrete variables in S d x A d (24 configurations) q(s,a) is approximated in the continous subspace S c x A c by a linear combination of polynomials (temperatures) and trigonometric functions (time): 6 nki (,, j) k kjl i l jl, i= 1 q ( s, a) = c T trig ( t) k = 1,...,24 rewritten as flattened vector product q( sa, ) = c p( T,..., T, t) = c p k= 1,...,24 k kj j 1 6 k j typically: j = 252 k j = coefficients c kj

9 Analytical Solution c older experience is discounted by γ Q = arg min R + q s, a, k = 1,...,24 t n 1 t i j n k γq γ sik (, ) + j γ ( sik (, ) + n sik (, ) + n) ckp sik (, ) i= 1 j= 0 2 s(i,k) is the index when (s d,a d ) i =(s d,a d ) k the i-th time rearranging yields: { (, ), (, ), } ( i) ( i 1) k = γ Q k + sik r sik l Q Q p p n 1 ( i) ( i 1) j n k = γq k 2 sik (, ) γ Rsik (, ) + j+ γ q ssik (, ) n, a + sik (, ) + n j= 0 b b p ( ε ) 1 c = 0.5 Q + I b k k k k rl for nummerical reasons ( )

10 Decision Making, Reward Normalization probability of choosing a i : π ( ai s) = n e j= 1 (, ) qsa e i (, j ) qsa τ τ ( ai s) ( aj s) ( qsai s) τ : π = π τ 0 : π arg max (, ) = 1 qsa (, ) 0! i normalization of reward per year (plots only!) by difference of average outside temperature and room set point temperature R R with T T T T T year year, norm = = 2 sp amb

11 Hardware & Software 2 Servers with 2 CPUs Intel Xeon Platinum 8164 each Each CPU with 26 cores 768 GB RAM Code written in Mathematica Computation time for 1 year simulation: 1min

12 SARSA vs Q-Learning

13 Effect of n in n-step SARSA

14 Effect of n in n-step SARSA n=1 out of plot range

15 Influence of Discount Factor γ Actual time 5:00. One more heating hour before 20:00 is needed. What hour seems to be the best choice? (If γ>0.952: heating at 5:00, otherwise heating at 19:00) especially too low γ s lead to suboptimal decisions and unrealistic q-values

16 n-step SARSA without Discount The discount of future rewards disturbs the optimal scheduling of actions with fixed and known costs. Alternative approach: n-step SARSA without discount n 1 k n γ k 0 t+ k+ γ = t+ n t+ n Gtn, = R qs (, a ) The time horizon is defined by n instead of γ!

17 n-step SARSA without Discount

18 Performance of n-step SARSA without Discount optimal range

19 Parallel «Supporting» Agents, Motivation Could larger scatter be exploited by helping worse agents become better and then by chance even be better than best agent?

20 Parallel «Supporting» Agents, Structure occupation & weather data R Agent 1 (p hyper,1, ε supp,1 ) Agent 2 (p hyper,2, ε supp,2 ) Agent 3 (p hyper,3 ε supp,3 ) Agent n (p hyper,n ε supp,n ), q best _ agent coeff, best _ agent R1, qcoeff,1 R R 2, qcoeff,2 3, qcoeff,3 control & performance parameter ε R ( ) ( # / min 1 ) i year τ supp c e,1 if # year even R R 0 if # year odd supp supp, i = i best _ agent ε > ε ε < ε supp supp : decision based on own parameters : decision based on best agents parameters needed to evalute performance

21 Parallel «Supporting» Agents, Results all agents show better performance than best agent without support

22 Parallel «Supporting» Agents, Results

23 RL Control Example (1)

24 RL Control Example (2)

25 RL Control Example (3) -0.1 *(energy rate) agent heats at cheaper night rate only

26 Simulated Reinforcement Learning Schematic: Peter Bolt, ACSS IMAP ZHAW

27 RBC with Parameters Optimized by RL Some simple fixed rules Setpoints for water storage heating 1 & 2 are parametrized, k 1 and k 2 learned Tamb + Tforecast Tsp, heating st1 = Tsp, room + Tsp, room k1 2 T + T T T T k 2 amb forecast sp, heating st 2 = sp, room + sp, room 2 Reward (energy consumption & set point violation) is normalized and thus almost independent of outside temperature: R = R with T = T T T T norm 2 sp amb

28 RL with Parametrized RBC acceptable performance in less than 1 year!

29 Summary RLC for a heating system converges to nearly optimal trajectories needs order of 100 simulated years for convergence ( simulated RL) Improvements to RL least-squares fit to get q(s,a) in one step n-step SARSA truncated reward sum without discount parallel, mutually supporting agents Alternative approach with parametrized RBC much shorter learning time due to less coefficients good performance

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30