An Online Learning Algorithm for Demand Response in Smart Grid

Size: px

Start display at page:

Download "An Online Learning Algorithm for Demand Response in Smart Grid"

Clarissa Shaw
5 years ago
Views:

1 An Onlne Learnng Algorthm for Demand Response n Smart Grd Shahab Bahram, Student Member, IEEE, Vncent W.S. Wong, Fellow, IEEE, and Janwe Huang, Fellow, IEEE Abstract Demand response program wth real-tme prcng can encourage electrcty users towards schedulng ther energy usage to off-peak hours. A user needs to schedule the energy usage of hs applances n an onlne manner snce he may not know the energy prces and the demand of hs applances ahead of tme. In ths paper, we study the users long-term load schedulng problem and model the changes of the prce nformaton and load demand as a Markov decson process, whch enables us to capture the nteractons among users as a partally observable stochastc game. To make the problem tractable, we approxmate the users optmal schedulng polcy by the Markov perfect equlbrum (MPE) of a fully observable stochastc game wth ncomplete nformaton. We develop an onlne load schedulng learnng (LSL) algorthm based on the actor-crtc method to determne the users MPE polcy. When compared wth the benchmark of not performng demand response, smulaton results show that the LSL algorthm can reduce the expected cost of users and the peak-to-average rato (PAR) n the aggregate load by 28% and 13%, respectvely. When compared wth the short-term schedulng polces, the users wth the long-term polces can reduce ther expected cost by 17%. Keywords: Demand response, real-tme prcng, partally observable stochastc game, onlne learnng, actor-crtc method. I. INTRODUCTION The future smart grd ams to empower utlty companes and users to make more nformed energy management decsons. Ths motvates the utlty companes to provde users wth ncentves to adjust the tmng of ther electrcty usage [1]. The ncentves may be through a demand response program wth tme-varyng prcng schemes such as real-tme prcng (RTP) and nclnng block rate (IBR) prcng [2]. Wth a properly desgned demand response program, the utlty company can decrease ts generaton cost due to the reducton of peak-to-average rato (PAR) n the aggregate load. Meanwhle, users can reduce ther payment by takng advantage of low prces at off-peak hours. There are several challenges for users to optmally determne ther energy schedule n a demand response program. Frst, f the utlty company uses RTP or IBR, the users schedulng decsons are coupled snce the applances energy Manuscrpt receved on Oct. 8, 2016, revsed on Jan. 11, 2017, and accepted on Feb. 2, Ths work s supported by the Natural Scences and Engneerng Research Councl of Canada (NSERC) under Strategc Project Grant (STPGP ), and the Theme-based Research Scheme (Project No. T23-407/13-N) from the Research Grants Councl of the Hong Kong Specal Admnstratve Regon, Chna. S. Bahram and V.W.S. Wong are wth the Department of Electrcal and Computer Engneerng, The Unversty of Brtsh Columba, Vancouver, BC, Canada, V6T 1Z4. J. Huang s wth the Department of Informaton Engneerng, The Chnese Unversty of Hong Kong, Hong Kong, emal: {bahrams, vncentw}@ece.ubc.ca, jwhuang@e.cuhk.edu.hk schedule of a user affects the prce that s charged to all users, hence affects other users cost. Second, each user s uncertan about the total demand of other users, as well as the tme of use and operaton constrants of hs own applances. In partcular, each applance s operaton depends on ts task specfcatons (e.g., task duraton, start tme/deadlne of the task), whch are not known a pror untl the user decdes to turn on that applance. Thrd, the users may not know the prce nformaton ahead of tme. There have been some efforts n tacklng the above challenges. We dvde the related lterature nto two man threads. The frst thread s concerned wth technques for schedulng the energy usage of the applances n a household wth a myopc user, who ams to mnmze hs cost n a short perod of tme (e.g., one day). Samad et al. n [3] proposed prcng algorthms based on stochastc approxmaton to mnmze the PAR of the aggregate load n one day for a sngle household. Chen et al. n [4] proposed a robust optmzaton approach to mnmze the worst-case daly bll payment of a myopc user n a market wth the RTP scheme. Eksn et al. n [5] captured the nteractons among myopc users wth heterogeneous but correlated consumpton preferences wth the RTP scheme as a Bayesan game. Forouzandehmehr et al. n [6] proposed a dfferental stochastc game framework to capture the nteractons among myopc users wth controllable applances. In these works, however, t was not mentoned how the proposed schedulng algorthms can be used for foresghted users, who am to mnmze ther long-term costs. The second thread s concerned wth technques for schedulng the applances n a household wth a foresghted user. Wen et al. n [7] proposed a renforcement learnng algorthm to address the applances schedulng problem n a household. Km et al. n [8] proposed a load schedulng algorthm based on Q-learnng for a mcrogrd wth tme-of-use prcng scheme. Lang et al. n [9] proposed a Q-learnng approach to mnmze the bll payment and dscomfort cost of a foresghted user n a household. Ruelens et al. n [10] proposed a batch renforcement learnng algorthm to schedule controllable loads such as water heater and heat-pump thermostat. These works, however, dd not menton how the proposed learnng algorthms can capture the decson makng of multple foresghted users. Xao et al. n [11] appled dynamc programmng to model the nteractons among multple foresghted supplers. Yao et al. n [12] studed the electrcty sharng problem among multple foresghted users wth the RTP scheme. The schedulng problem of each ndvdual user s formulated as a Markov decson process. A specfc structure for the suboptmal polcy of each user s determned. Ja et al. n [13] proposed a

2 learnng algorthm based on stochastc approxmaton for the utlty company to determne the day ahead prce values n a market wth multple foresghted users. These works, however, dd not study the operaton constrants of dfferent electrcal applances n resdental sectors. In ths paper, we focus on desgnng a load schedulng learnng (LSL) algorthm for multple resdental users, who schedule ther applances n response to RTP nformaton. Each user s aware that the total energy consumpton (not just hs own) wll affect the prce announced by the utlty company. Furthermore, each user s selfsh and ams to mnmze hs own bll payment. We study the long-term nteractons among foresghted users nstead of the short-term nteractons among myopc users. It enables us to model the users decson makng wth uncertanty about the prce nformaton and load demand of ther applances as a Markov decson process wth dfferent states for dfferent possble scenaros. We capture the nteractons among users as a stochastc game [14]. Markov perfect equlbrum (MPE) s a standard soluton concept for analyzng stochastc games. Several algorthms have been proposed to determne an MPE n fully observable stochastc games [15] [22]. Some algorthms are model-based and requre knowledge of the dynamcs of the system,.e., the state transton probabltes. The model-based learnng algorthms nclude ratonal learnng methods [15] [17], lnear programmng based algorthms [18], [19], and homotopy method [20]. Some other learnng algorthms are model-free and am to determne an MPE when the system dynamcs are unknown. Examples of model-free approaches nclude Lyapunov optmzaton [21] method and renforcement learnng algorthms [22]. In the demand response program, the underlyng game s partally observable [23] [25], snce each user only observes hs own state and s uncertan about other users states. The key challenge n our model s to characterze the MPE under the partal observablty of each user and the nterdependency among the users polces. Ths paper s an extenson of our prevous work [26] that takes nto account the uncertanty n the energy prce and users load demand. The contrbutons of ths paper are as follows: Novel Soluton Approach: The partally observable stochastc game s a realstc framework to model the nteractons among users, but t s dffcult to solve. To make the problem tractable, we propose an algorthm executed by each user to approxmate the state of all users usng some addtonal nformaton from the utlty company. It enables us to approxmate the users optmal polcy by the MPE polcy n a fully observable stochastc game wth ncomplete nformaton, whch s more tractable. Learnng Algorthm Desgn: We formulate an ndvdual optmzaton problem for each household, ts global optmal soluton corresponds to the MPE polcy of the proposed fully observable stochastc game wth ncomplete nformaton. We develop an actor-crtc method [27] [30]-based dstrbuted LSL algorthm that converges to the MPE polcy. The algorthm s onlne and model-free, whch enables users to learn from the consequences of ther past decsons and schedule ther applances n an onlne fashon wthout knowng the system dynamcs. Performance Evaluaton: We evaluate the performance of the LSL algorthm n reducng the PAR n the aggregate load and the expected cost of users. Compared wth the benchmark of not performng demand response, our results show that the LSL algorthm can reduce the PAR n the aggregate load and the expected cost of foresghted users by 13% and 28%, receptvely. We compare the polcy of the foresghted and myopc users, and show that foresghted users can reduce ther daly cost by 17%. When compared wth the Q-learnng method (e.g., n [7] and [8]), the LSL algorthm based on the actor-crtc method converges faster to the MPE polcy. The rest of ths paper s organzed as follows. Secton II ntroduces the system model. In Secton III, we model the nteractons among users as a partally observable stochastc game and approxmate t by a fully observable stochastc game wth ncomplete nformaton. In Secton IV, we develop a dstrbuted learnng algorthm to compute the MPE. In Secton V, we evaluate the performance of the proposed algorthm through smulatons. Secton VI concludes the paper. II. SYSTEM MODEL We consder a system wth one utlty company and a set N ={1,..., N} of N households. Each household s equpped wth an energy consumpton controller (ECC) responsble for schedulng the applances n that household. The ECC s connected to the utlty company va a two-way communcaton network, whch enables the exchange of the prce nformaton and the household s load demand. Users partcpate n demand response program for a long perod of tme (e.g., several weeks). We dvde the tme nto a set T = {1,..., T } of T equal tme slots, e.g., 15 mnutes per tme slot. In ths paper, we use ECC, household, and user nterchangeably. A. Applances Model Let A = {1,..., A } denote the set of applances n household N, where A s the total number of applances. In each tme slot, an applance s ether awake or asleep, ndcatng whether t s ready to operate or not. We defne the applance s operaton state as follows: Defnton 1 (Applance Operaton State): For household N, the operaton state of applance a A n tme slot t T s a tuple s a,,t = (r a,,t, q a,,t, δ a,,t ), where r a,,t s the number of remanng tme slots to complete the current task, q a,,t s the number of tme slots for whch the current task can be delayed, and δ a,,t s the number of tme slots snce the most recent tme slot that applance a becomes awake wth the most recent new task. Fg. 1 shows the values of r a,,t, q a,,t, and δ a,,t for applance a A, whch has a task that should be operated for three tme slots wth a maxmum delay of three tme slots. When applance a becomes awake n tme slot t, r a,,t and q a,,t are ntalzed based on the current task (e.g., here we have r a,,t = q a,,t = 3), and δ a,,t s set to 1. The value of r a,,t decreases when applance a executes ts task and becomes 0 when the applance has completed ts task and s

3 Fg. 1. The values of r a,,t, q a,,t, and δ a,,t for applance a, whch should be operated for three tme slots wth a maxmum delay of three tme slots. asleep n tme slot t. The value of q a,,t remans unchanged when the task s executed, and decreases when the task s delayed. When q a,,t s 0, the ECC cannot delay the applance s task. The value of δ a,,t ncreases n each tme slot and s reset to 1 when applance a becomes awake wth a new task. The applance may start a new task rght after completng the current task. Thus, wthout becomng asleep, r a,,t and q a,,t are ntalzed based on the new task, and δ a,,t s set to 1. ECC does not know when an applance becomes awake ahead of tme. Instead, t has a belef regardng P a, (δ a,,t ), the probablty that the dfference between two sequental wake-up tmes for applance a s δ a,,t, for δ a,,t 1. Such a probablty dstrbuton can be estmated, for example, based on the awake hstory for applance a. ECC can approxmate P a, (δ a,,t ) by the rato of the events that the dfference between two consecutve wake-up tmes s δ a,,t n a gven hstorcal data record. Applance a may become awake n the next tme slot (for a new task) f ether applance a s asleep or t wll complete the current task n the current tme slot. In Appendx A, we show that gven current tme t, the probablty P a,,t+1 that applance a A becomes awake wth a new task n the next tme slot t + 1 T s P a, (δ a,,t ) P a,,t+1 = 1 δ a,,t 1 =1 P a, ( ). (1) We partton the set of applances nto must-run and controllable. Let A M denote the set of must-run applances n household. Examples of must-run applances nclude lghtng and TV. The ECC has no control over the operaton of must-run applances. On the other hand, the ECC can control the tme of use for the controllable applances. The set of controllable applances n household can further be parttoned nto two sets: the set A N of non-nterruptble applances, and the set A I of nterruptble applances. Examples of non-nterruptble applances nclude washng machne and dsh washer, and examples of nterruptble applances nclude ar condtoner and electrc vehcle (EV). The ECC may schedule a nonnterruptble applance durng several consecutve tme slots, but cannot nterrupt ts task. The ECC may delay or nterrupt the operaton of an nterruptble applance. Each tme an applance a A becomes awake, t sends nformaton about ts new task s specfcatons to the ECC. Defnton 2 (Task s Specfcatons): For an applance a A, the specfcatons of ts task nclude the average power consumpton p avg a, to execute the task, the schedulng wndow T a, = [t s a,, td a, ] correspondng to a tme nterval whch ncludes the earlest start tme t s a, T and the deadlne t d a, T for the task, the operaton duraton d a, for a must-run or non-nterruptble applance correspondng to the total number of tme slots requred to complete the task, and the nterval [d mn a,, dmax a, ] for an nterruptble applance correspondng to the range of the operaton duraton. The value of the average power consumpton p avg a, s assumed to be fxed and known a pror for each applance a. The operaton duraton d a, for a non-nterruptble applance a A N s fxed. On the other hand, the operaton duraton d a, for a task of an nterruptble applance a A I can be any value n the range of [d mn a,, dmax a, ], and we have dmn a, 0 and d max a, t d a, ts a,. We use the bnary decson varable x a,,t {0, 1} to ndcate whether an applance a A s scheduled to operate n tme slot t (x a,,t = 1) or not (x a,,t = 0). Notce that x a,,t s equal to 0 when applance a s asleep (.e., r a,,t = 0). Let x,t = (x a,,t, a A ) denote the schedulng decson vector for all applances n household n tme slot t. ECC can nfer the state s a,,t+1 of applance a n the next tme slot t + 1 from the current state s a,,t, the probablty P a,,t+1, applance s type, the task s specfcatons, and the schedulng decson x a,,t as follows: 1) Must-run applances: The feasble acton for applance a A M n tme slot t T s { 1, f ra,,t 1 x a,,t = (2) 0, f r a,,t = 0. When applance a A M becomes awake wth a new task, r a,,t s set to d a,, and ECC operates the applance wthout delay,.e., q a,,t s equal to 0. Gven current tme t, the operaton state n tme slot t + 1 can be obtaned as follows: If ether applance a A M s asleep (.e., r a,,t = 0) or t wll complete ts task n the current tme slot (.e., r a,,t = 1), then applance a becomes awake n tme slot t + 1 wth probablty P a,,t+1, wth the correspondng next state as s a,,t+1 = (d a,, 0, 1), (3) and the applance s asleep n tme slot t + 1 wth probablty 1 P a,,t+1, wth the correspondng next state as s a,,t+1 = (0, 0, δ a,,t + 1). (4) If r a,,t 2, then applance a A M has not completed ts task yet. Wth probablty 1, the correspondng next state as s a,,t+1 = (r a,,t 1, 0, δ a,,t + 1). (5) 2) Non-nterruptble controllable applances: The feasble acton for applance a A N n tme slot t T s 0 or 1, f t T a,, r a,,t 1, q a,,t 1, x a,,t = 1, f t T a,, r a,,t 1, q a,,t = 0, (6) 0, f r a,,t = 0. Equaton (6) mples that ECC can decde to operate a nonnterruptble applance a or not when the applance s awake

4 (r a,,t 1) and ts current task can be delayed (q a,,t 1). ECC has to operate an awake applance f the task cannot be delayed (q a,,t =0). ECC wll not schedule applance a f t s asleep (r a,,t = 0). When applance a A N becomes awake, r a,,t and q a,,t are set to d a, and t d a, ts a, d a, + 1, respectvely. Gven current tme t, the operaton state n the next tme slot s as follows: If ether applance a A N s asleep (.e., r a,,t = 0) or t wll complete the current task n the current tme slot (.e., r a,,t = 1 and x a,,t = 1), then the applance becomes awake n tme slot t + 1 wth probablty P a,,t+1, wth the correspondng next state as s a,,t+1 = (d a,, t d a, t s a, d a, + 1, 1), (7) and the applance s asleep n tme slot t + 1 wth probablty 1 P a,,t+1, wth the correspondng next state as s a,,t+1 = (0, 0, δ a,,t + 1). (8) If r a,,t 2 and x a,,t = 1, then applance a A N has not completed ts task yet and s scheduled n the current tme slot t. The applance cannot be delayed n the next tme slot,.e., q a,,t+1 = 0. Wth probablty 1, the correspondng next state as s a,,t+1 = (r a,,t 1, 0, δ a,,t + 1). (9) If r a,,t 1 and x a,,t = 0, then applance a A N has not completed ts task yet and s not scheduled n the current tme slot t. Wth probablty 1, we have s a,,t+1 = (r a,,t, q a,,t 1, δ a,,t + 1). The acton set n (6) mples that x a,,t cannot be equal to 0 f q a,,t s 0 n tme slot t. 3) Interruptble controllable applances: Equaton (6) s the feasble acton for applance a A I n tme slot t T. When an nterruptble applance a A I becomes awake wth a new task, r a,,t s set to the maxmum operaton duraton d max a,. To operate the applance for at least d mn a, tme slots, ECC can delay the task n at most t d a, ts a, dmn a, + 1 tme slots. The maxmum operaton duraton may not be completed before the deadlne wthn the schedulng horzon T a,. In ths case, f t + 1 T a,, the nterruptble applance wll become ether asleep or awake wth a new task n the next tme slot t + 1. The operaton state n the next tme slot t + 1 s as follows: If the next tme slot s not n the schedulng wndow (.e., t + 1 T a, ), applance a A I s asleep (.e., r a,,t = 0), or the applance wll complete ts task n the current tme slot (.e., r a,,t = 1 and x a,,t = 1), then the applance becomes awake n tme slot t+1 wth probablty P a,,t+1, wth the next state as s a,,t+1 = (d max a,, t d a, t s a, d mn a, + 1, 1), (10) and the applance s asleep n tme slot t + 1 wth probablty 1 P a,,t+1, wth the correspondng next state as s a,,t+1 = (0, 0, δ a,,t + 1). (11) If the next tme slot s n the schedulng wndow (.e., t+1 T a, ), r a,,t 2, and x a,,t = 1, then applance a A I s scheduled n the current tme slot t. The applance s awake n the next tme slot t + 1 wth probablty 1, and the next state s s a,,t+1 = (r a,,t 1, q a,,t, δ a,,t + 1). (12) If t + 1 T a,, r a,,t 1, and x a,,t = 0, then the task of applance a A I s not scheduled n the current tme slot t. The applance s awake n the next tme slot t + 1 wth probablty 1, wth the correspondng next state as s a,,t+1 = (r a,,t, q a,,t 1, δ a,,t + 1). (13) B. Prcng Scheme and Household s Cost In a dynamc prcng scheme, the payment by each household depends on the tme and total amount of energy consumpton. Let l,t = a A p avg a, x a,,t denote the aggregate load of household n tme slot t. Let lt others denote the aggregate background load demand of other users n tme slot t that do not partcpate n the demand response program. The utlty company knows lt others at the end of tme slot t. Let l t = lt others + N l,t denote the aggregate load demands of all users n tme slot t. We assume that the utlty company uses a combnaton of RTP and IBR [3], [31]. In tme slot t T, the unt prce λ t s { ( ) λ 1,t, λ t lt = λ 2,t, f 0 l t l th f l t > lt th, t, (14) where λ 1,t λ 2,t, t T. Here, λ 1,t and λ 2,t are the unt prce values n tme slot t when the aggregate load s lower and hgher than the threshold lt th, respectvely. We defne the vector of prce parameters n tme slot t as λ t = (λ 1,t, λ 2,t, lt th ). The prce parameters are set by the utlty company accordng to dfferent factors such as the tme of the day, day of the week, wholesale market condtons, and the operaton condtons of the power network. We can capture the prce changes by makng the followng assumpton: Assumpton 1 The prce parameters are generated accordng to a hdden Markov model. In each hdden state, the prce parameters are generated from a probablty dstrbuton whch s unknown to the users [32], [33]. Assumpton 1 s consstent wth many realstc stuatons of prce determnaton. For example, the prce parameters λ t may change perodcally. In ths case, the hdden states correspond to the tme of the day, and the prce parameters vector for each hdden state s fxed. In a more general model, a hdden state corresponds to the tme of the day and the prce parameters are chosen from a known probablty dstrbuton (e.g., a truncated normal dstrbuton) n each hdden state. If ths s the case, the probablty dstrbuton for each tme slot can be estmated by examnng the hstorcal prces of the same tme slot from many days [33]. In Secton V, we compare the users schedulng decsons when the utlty company apples the perodc and random prce parameters, respectvely. The payment of household n tme slot t s l,t λ t (l t ). When the ECC nterrupts the operaton of the nterruptble applances, the correspondng user wll experence a dscomfort

5 cost. When an nterruptble applance a A I becomes awake, t sends the user s desrable operaton schedule x des a,,t for all tme slots t T a, and the coeffcents ω a,,t, a A I, t T a, (measured n terms of $) to the ECC to reflect the user s dscomfort caused by any potental change of the operaton schedule of nterruptble applance a. For each household, we capture the dscomfort cost from schedulng the nterruptble applances by the weghted Eucldean dstance between the operaton schedule wth demand response and the desrable operaton schedule as a A ω xa,,t I a,,t x des a,,t, whch s also used n [34]. The total cost for each household n tme slot t nvolves the payment and dscomfort cost. That s, c,t (l t ) = l,t λ t (l t ) + ω a,,t x a,,t x des. (15) a A I a,,t In the long-term schedulng problem, the schedulng horzon T s a large number (e.g., f the schedulng horzon s sx months and each tme slot s 15 mnutes, then we have T 17000). Thus, t s reasonable to approxmate the problem wth an nfnte schedulng horzon, and consder the expected dscounted cost of each household wth the dscount factor β [35, pp. 150] as (1 β) β t 1 c,t (l,t, l,t ). (16) t=1 The parameter β n (16) can be used to characterze a wde range of users behavour. When β s close to zero, the users are myopc,.e., they am to mnmze ther short-term cost (e.g., daly cost) wthout consderng the consequences of ther short-term polcy on ther future cost. When β s close to one, the users are foresghted,.e., they am to mnmze ther long-term cost. One may assume dfferent values of β for dfferent partcpatng users. In ths paper, we assume that all users have the same value of β. In a more general future study, one may consder the case where dfferent users have dfferent values of β. In the cost model (16) wth an nfnte schedulng horzon, we can consder the statonary schedulng decson makng that s ndependent of tme. Specfcally, the decson makng only depends on the prce parameters and the applance operaton state n a tme slot, but s ndependent of tme slot ndex t. Therefore, we can remove tme ndex t from the applances states, prce parameters, and the household s cost. III. PROBLEM FORMULATION Due to prvacy concerns, each household does not reveal the nformaton about ts applances to other households. We have Assumpton 2 The ECC can only observe the operaton state of the applances n ts own household. We capture the nteractons among households n demand response program as a partally observable stochastc game. Game 1 Households Partally Observable Stochastc Game: Players: The set of households N. States: The state of household s s = (s a,, a A ). Observatons: The observaton of household s o = (s, λ) O, where O s the set of possble observatons for household. Let o = (o, N ) O denote the observaton profle of all households, where O = N O. We use notatons z(o ) and z(o) to denote the value of an arbtrary parameter z n observaton o of household and observaton profle of all households o, respectvely. Actons: We defne the acton vector of household n observaton profle o as x (o) = (x a, (o), a A ). Let x(o) = (x (o), N ) denote the acton profle of all households. Let X (o ) denote the feasble acton space obtaned from (2), (6) for household wth observaton o. Transton Probabltes: Gven the current prce parameters, Assumpton 1 mples that the prce parameters vector s Markovan. From Secton II-A, the next state of an applance depends only on ts current state and acton. Thus, the transton between the observatons of a household s Markovan. Let P (o o, x (o)) denote the transton probablty from observaton o O to o O wth acton x (o). It depends on the applances wake-up probablty n (1). Furthermore, the users have ndependent preferred plans of usng ther applances. Hence, the states of dfferent households are ndependent. The transton probablty from observaton o O to o O wth acton profle x(o) s P (o o, x(o)) = N P (o o, x (o)). Statonary Polces: Let π (o, x (o)) denote the probablty of choosng a feasble acton x (o) n observaton o. Let π (o)=(π (o, x (o)), x (o) X (o )) denote the probablty dstrbuton over the feasble actons. We defne the statonary polcy for household as the vector π = (π (o), o O). Let π = (π, N ) denote the jont polcy of all households, and π denote the polcy for all households except household. Value functons: Under a gven jont polcy π, the value functon V π : O R returns the expected dscounted cost for household startng wth observaton profle o. It can be expressed as the followng Bellman equaton [14]: V π { π (o) = E π(o) Q ( o, x (o) )}, o O, (17) where E π(o){ } denotes the expectaton ( over the probablty dstrbuton π (o). Functon Q π o, x (o) ) s the Q-functon for household wth acton x (o) n observaton profle o when other households polcy s π [14]. We have ( Q π o, x (o) ) = E π (o){ (1 β) c (o, x(o)) +β } P (o o, x(o)) V π (o ). (18) o O It s computatonally dffcult to determne the optmal polces for the households n such a partally observable stochastc game. In a partally observable stochastc game among users, each user needs to know what other users are observng n each tme slot. Inspred by the works n [23] [25], we propose an algorthm executed by each ECC to estmate the observaton profle of all households. It enables us to study the users optmal polcy n a fully observable stochastc game wth ncomplete nformaton, n whch the households play a sequence of Bayesan games.

6 Algorthm 1 Executed by ECC N. 1: Communcate the average load demand l avg (o ) for all feasble actons x (o ) X (o ) to the utlty company. 2: Receve the average aggregate load l avg (o) from utlty company. 3: Approxmate the observaton profle by ô:=(l avg (o), λ). A. Observaton Profle Approxmaton Algorthm To make the analyss of Game 1 tractable, we propose an algorthm executed by each ECC to approxmate the observaton of all households usng some addtonal nformaton. Let ô denote the approxmate observaton profle of all households. Algorthm 1 descrbes how ECC obtans ô. ECC sends the average load demand l avg (o ) of all feasble actons x (o ) X (o ) to the utlty company. ECC knows λ and receves the average aggregate load l avg (o)= 1 N j N lavg j (o j ). It approxmates the observaton profle o by vector ô = (l avg (o), λ). In Algorthm 1, each household receves nformaton on the average aggregate load demands. Thus, the prvacy of each ndvdual household s protected. All ECCs obtan the same approxmaton for an observaton profle. Thus, we can consder a fully observable stochastc game wth ncomplete nformaton. Under a gven approxmate observaton profle ô, the households play a Bayesan game, as each household may have dfferent observatons o, and thus dfferent sets of feasble actons. Game 2 Households Fully Observable Stochastc Game wth Incomplete Informaton: Ths game s constructed from Game 1 f the households defne ther actons and polcy as follows: Actons: Let O (ô) O denote the set of possble observatons for household n the approxmate observaton profle ô. We defne the set of actons for household n the approxmate observaton profle ô as ˆX (ô) = {x (o ) : x (o ) X (o ), o O (ô)}. The feasblty of an acton x (ô) ˆX (ô) depends on the observaton o of household. Polces: We defne the statonary polcy π (ô, x (o )) as the probablty of choosng a feasble acton x (o ) X (o ) n an approxmate observaton profle ô when the observaton of household s o O (ô). Let P (o ô) be the probablty that household has observaton o O (ô) when the approxmate observaton profle s ô. Hence, the probablty of choosng any acton x (ô) ˆX (ô) s π (ô, x (ô)) = P (o ô)π (ô, x (o )). Let π (ô) = (π (ô, x (ô)), x (ô) ˆX (ô)) denote the probablty dstrbuton over the actons for household n an approxmate observaton profle ô. We defne the polcy for household n Game 2 as the vector π = (π (ô), ô O). B. Markov Perfect Equlbrum (MPE) Polcy In ths subsecton, we dscuss how each household determnes a polcy π (ô) n Game 2 for any approxmate observaton profles ô to mnmze ts value functon V π(ô). The MPE s a standard soluton concept for the partally observable stochastc games. The MPE corresponds to the users polces wth Markov propertes and s compatble wth the assumpton for the applance model n Secton II-A. The MPE n Game 2 s defned as follows: Defnton 3 A polcy π MPE = (π MPE, N ) s an MPE f for every household N wth a polcy π, we have V (π,πmpe ),π MPE ) (ô) V (πmpe (ô), N, ô O. (19) The MPE polcy s the fxed pont soluton of every household s best response polcy. Household solves the followng Bellman equatons when other households polces are fxed: V πmpe (ô) = mnmze E π(ô) π (ô) {Q πmpe } (ô, x (ô)), ô O. (20) As the followng Theorem states, the exstence of the MPE s guaranteed for Game 2. Theorem 1 Game 2 has at least one MPE n stochastc statonary polces. The proof of Theorem 1 can be found n Appendx B. The MPE s the fxed pont of N recursve problems n (20) for all households. Problem (20) mples that for household wth acton x (ô) under observaton profle ô n the MPE, we have V πmpe (ô) Q πmpe (ô, x (ô)). We ntroduce an equvalent non-recursve optmzaton problem for each household, whch s more tractable. For household N, we defne the Bellman error [14] for an acton x (ô) n an approxmate observaton profle ô as B (V π, ô, x (ô)) = Q π (ô, x (ô)) V π (ô). (21) We defne functon (V π Bellman errors for all observatons ô O. That s (V π, π ) = E π(ô) ô O, π ) as the sum of the expected { B (V π, ô, x (ô)) }. (22) Each household ams to determne the polcy π and the value functon V π to mnmze (V π, π ) by solvng the followng optmzaton problem. mnmze V π,π (V π, π ) (23) subject to B (V π, ô, x (ô)) 0, ô O, x (ô) ˆX (ô). Problem (23) s generally a non-convex problem, and may have several local mnma. We show that the MPE polcy of household s the global mnmum of problem (23). Theorem 2 The polcy π MPE s an MPE of Game 2 f and only f for all households N wth acton x (ô) ˆX (ô), we have π MPE ( (ô, x (ô)) B V π MPE, ô, x (ô) ) = 0, ô O. (24) The proof can be found n Appendx C. Theorem 2 mples that the Bellman error s zero for an acton wth postve probablty at the MPE. Thus, (V πmpe, π MPE ) = 0 and the MPE s the global optmal soluton of problem (23) for all households. Solvng problem (23) s stll challengng, as each ECC requres the values of the unavalable transton probabltes between the observatons. Ths motvates us to develop a model-free learnng algorthm that enables each ECC to sched-

7 ule the applances n an onlne manner wthout knowng the system dynamcs. Bascally, each ECC updates the polcy and value functon based on the consequences of ts past decsons. As part of the learnng algorthm, we need to record the observaton and acton spaces for a household. In order to reduce the complexty, we use the lnear functon approxmaton to estmate the value functon [36, Ch. 3]. For household, let φ (ô) = (φ v, (ô), v V) denote the row vector of bass functons, where V s the set of bass functons. Let θ = (θ v,, u V) denote the row vector of weght coeffcents. The approxmate value functon for household s V π (ô, θ ) = θ φ T (ô), (25) where T s the transpose operator. It enables ECC to compute vector θ wth V elements nstead of the value functon V π (ô) for all approxmate observaton profles ô. We parameterze the polcy π for household va softmax approxmaton [36, Ch. 3]. Let µ (ô, x (ô)) = ( µ p, (ô, x (ô)), p P ) denote the row vector of bass functons, where P s the set of bass functons. Let ϑ = (ϑ p,, p P) denote the row vector of weght coeffcents. The approxmate probablty of choosng acton x (ô) ˆX (ô) s π (ô, x (ô), ϑ )= e (ϑµt (ô,x(ô))) x (ô) ˆX (ô) e(ϑµt (ô,x (ô))). (26) To smplfy the computaton of ths approxmaton, we use the vector of compatble bass functons ψ (ô, x (ô)) = ( ψp, (ô, x (ô)), p P ), where ψ p, (ô, x (ô)) = ln(π (ô, x (ô), ϑ )) ϑ p,. (27) We can show that for the softmax parameterzed polcy, the vector of bass functons µ (ô, x (ô)) can be replaced wth vector ψ (ô, x (ô)) [30]. IV. ONLINE LEARNING ALGORITHM DESIGN In ths secton, we propose a load schedulng learnng (LSL) algorthm executed by the ECC of each household to determne the MPE polcy. We use an actor-crtc learnng method, whch s more robust than the actor-only methods (such as the polcy evaluaton [22, Ch. 2]) and faster than the crtc-only methods (such as the Q-learnng and temporal dfference (TD) learnng [22, Ch. 6]). The concept of the actorcrtc was orgnally ntroduced by Wtten n [27] and then elaborated by Barto et al. n [28]. A detaled study of the actorcrtc algorthm can be found n [29], [30]. Our LSL algorthm s based on the frst proposed algorthm n [30]. The ECC s responsble for the actor and crtc updates. In the crtc update, the ECC evaluates the polcy to update the value functon. In the actor update, t updates the polcy to decrease the objectve value of problem (23) based on the updated value functon. In the polcy update, we use the gradent method wth a smaller step sze compared wth the step sze n the value functon s update, thereby usng a two-tmescale update process [30]. Algorthm 2 descrbes the LSL algorthm executed by ECC. The ndex k refers to both teraton and tme slot. Our algorthm nvolves the ntaton and schedulng phases. Lne 1 descrbes the ntalzaton n tme slot k = 1. The loop nvolvng Lnes 2 to 14 descrbes the schedulng phase, whch ncludes the observaton profle approxmaton, the crtc update, the actor update, and the bass functon constructon. In Lnes 3, ECC executes Algorthm 1 to obtan the approxmate observaton profle ô. In tme slot k = 1, ECC does not have any experence from ts past decsons and chooses an acton n Lne 11. For k > 1, the crtc and actor updates are executed. ECC determnes the updated vector θ k usng the TD approach [22, Ch. 6]. The TD error e k 1 s e k 1 TD = (1 β)c (ôk 1, x k 1 (ô k 1 ) ) +βv π,k 1 (ôk, θ k 1 ) V π,k 1 (ôk 1, θ k 1 ). (28) The crtc update for ECC s θ k = θ k 1 TD + γc k 1 e k 1 TD φ (ôk 1 ), (29) where γc k s the crtc step sze n teraton k. In the actor update module, ECC determnes the updated vector ϑ k usng the gradent method wth descent drecton. In partcular, ECC uses the descent drecton π k 1(ôk 1, x k 1 (ô k 1 ), ϑ k 1 ) ϑ k 1 ( V π,k 1, π k 1 ) to ensure convergence to the MPE. Snce the gradent s not avalable, ECC uses vector e k 1 TD ψ (ô k 1, x k 1 (ô k 1 )) as an estmate of the gradent [30, Algorthm 1]. Therefore, the convergence to the MPE s guaranteed, snce the TD error e k 1 TD s an estmate for the Bellman error for acton xk 1 n teraton k 1. Thus, the descent drecton s zero f condton (24) s satsfed. The actor update for ECC s ϑ k = ϑ k 1 γa k π k 1(ôk 1, x k 1 (ô k 1 ), ϑ k 1 ) e k 1 TD ψ (ô k 1, x k 1 (ô k 1 )), (30) where γa k s the actor step sze n teraton k. We use the approach n [37] to autonomously construct the new bass functons ψ P +1, (ô, x (ô)) and φ V +1, (ô). The canddate for the bass functon ψ P +1, (ô, x (ô)) s the TD error e k 1 TD n (28), whch estmates the Bellman error. The expectaton over the Bellman errors of the feasble actons x (ô k 1 ) X (o k 1 ) s the canddate for φ V +1, (ô). We have ψ P +1, (ô, x (ô)) = e k 1 TD, (31) φ V +1, (ô) = E { B k 1 ( V π,k 1, ô k 1, x (ô k 1 ) )}. (32) The expectaton n (32) s over the probablty of choosng each feasble actons x (ô k 1 ) X (o k 1 ). In Appendx D, we explan how to approxmate the Bellman error for each feasble acton. In Lne 7, ECC checks the convergence of and decdes whether to add the new bass functons or not. θ k In Lne 11, ECC schedules the applances n the current tme slot k. In Lne 12, ECC receves the cost (ôk c, x k (ô k )). Next tme slot s started n Lne 13. In Lne 14, the stoppng crteron s gven. From Theorem 2, LSL algorthm converges to the MPE f the objectve value, π k 1 ) s zero. ECC computes the approxmate objectve value by summng over the expected Bellman errors up to teraton k 1 as (V π,k 1 (V π,k 1, π k 1 ) =

8 Algorthm 2 LSL Algorthm Executed by ECC N. 1: Set k := 1, ɛ := 10 3, and ξ = Set φ 1,( ) := 1 and ψ 1,( ):=1, and randomly ntalze θ1, 1 and ϑ 1 1,. 2: Repeat 3: Observe o k := (s k, λ k ). Approxmate ô k usng Algorthm 1. 4: If k 1, 5: Determne the updated vector θ k accordng to (29). 6: Determne the updated vector ϑ k accordng to (30). 7: If θ k θ k 1 < ɛ, 8: Construct new bass functons ψ P +1, (ô, x (ô)) and φ V +1, (ô) usng (31) and (32). 9: End f 10: End f 11: Choose acton x k (ô k ) usng polcy π k (ô k, ϑ k (ôk ). 12: Receve the cost c, x k (ô k ) ) from the utlty company. 13: k := k : Untl ˆf obj k 1 j=1 (V π,k 1 E π k 1 (ô j ), π k 1 ) < ξ. { B j( V π,k 1, ô j, x (ô j ) )}. (33) The suffcent condtons for the actor and crtc step szes to ensure the convergence of the LSL algorthm are gven n [29]. In the proposed model-free LSL algorthm, ECC does not know the next states of the applances untl the next tme slot begns n Lne 13. The ECC updates ts value functon usng the TD error n (28), whch depends on the next tme slot observaton. Therefore, the ECC only goes through one teraton per tme slot. V. PERFORMANCE EVALUATION In ths secton, we evaluate the performance of the LSL algorthm n a system, where one utlty company serves 200 households that partcpate n the demand response program. The schedulng horzon s sx months. Each tme slot s 15 mnutes. We consder sx controllable applances for each household, e.g., dsh washer, washng machne, and stove are non-nterruptble applances, and EV, ar condtoner, and water heater are nterruptble applances. We model other applances such as refrgerator and TV as must-run applances. Table I summarzes the task specfcatons of the controllable applances [38]. For the EV n each household, we have d mn a, = (Bd B0 )/pavg a, and dmax a, = (Bmax B 0)/pavg a,, where B 0 s the ntal chargng level when the EV awakes, B d s the chargng demand for the next trp, and B max s the battery s maxmum capacty. The chargng demand of the EV n household s unformly chosen at random from the set {18 kwh, kwh,..., 24 kwh}. The battery capacty s set to 30 kwh. Typcally, the user s ndfference between the chargng patterns for the EV as long as the chargng s fnshed before the deadlne. Thus, we set coeffcents ω a,,t, t T to zero for the EV. Coeffcents ω a,,t, t T are chosen unformly at random from the nterval [$0, $0.5] for the ar condtoner and water heater. We set the desred load pattern (x des a,,t, t T ) of the ar condtoner to a 16-hour perod, durng whch the applance turns on for an hour and turns off n the next hour n a perodc fashon. We set the desred load pattern of the water heater to a 5-hour perod wthout nterrupton. To smulate the non-nterruptble applances, we consder TABLE I OPERATING SPECIFICATIONS OF CONTROLLABLE APPLIANCES. Applance ( avg p a,, d a,, d mn a,, ) dmax a, Dsh washer (1.5 kw, 2 hr,, ) Washng machne (2.5 kw, 3 hr,, ) Stove (3 kw, 3 hr,, ) EV Ar condtoner Water heater am am (3 kw,, (B d B0 )/pavg a,, (Bmax B 0)/pavg a, ) (1.5 kw,, 2 hr, 8 hr) (2.5 kw,, 0 hr, 5 hr) 12 pm 6 pm 12 am 6 am 12 pm 6 pm 12 am 6 am Fg. 2. Prce parameters over one day: (a) l th t ; (b) λ 1,t and λ 2,t. several schedulng wndows selected unformly between 10 am and 10 pm, wth a length that s unformly chosen at random from set {4 hr, 5 hr, 6 hr, 7 hr}. For the washng machne, we model (P a, ( ), 1) as a truncated normal dstrbuton whch s lower bounded by zero, and has a mean value of 288 tme slots and a standard devaton of 60 tme slots. For other applances, we use a truncated normal dstrbuton wth a mean value of 96 tme slots and a standard devaton of 20 tme slots. In practcal mplementatons, the probablty dstrbuton (P a, ( ), 1) for each applance a can be approxmated by usng the hstorcal record on the usage behavour of each user. Unless stated otherwse, the prce parameters vary perodcally wth a perod of one day. As dscussed n Secton II-B, the perodc prce parameter vector s a specal case for the hdden Markov model n Assumpton 1. Fgs. 2 (a) and (b) show lt th, t T, and λ 1,t and λ 2,t, t T over one day, respectvely. The actor and crtc step szes n teraton k of the LSL algorthm are set to γa k = m a /k 2 3 and γc k = m c /k, respectvely. Snce each ECC may use dfferent values for m a and m c n practce, we choose m a and m c unformly from [0.5, 2] for each household. Unless stated otherwse, the dscounted factor β s set to 0.995,.e., the users are foresghted. For the benchmark scenaro wthout demand response, the non-nterruptble applances are operated as soon as they become awake. The ar condtoner and water heater are operated

9 Load demand (kw) Wthout load schedulng Wth load schedulng Aggregate load (MW) 0 6 am 12 pm 6 pm 12 am 6 am 12 pm 6 pm 12 am 6 am Tme (hour) am 2 0 Wthout load schedulng Wth load schedulng Must-run load 12 pm 6 pm 12 am 6 am Tme (hour) Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Fg. 3. (a) Load demand for household 1 over two days; (b) aggregate load demand of users over one day; (c) aggregate load demands of all users over seven days wth and wthout load schedulng. accordng to ther desred load patterns. The EV starts to charge when t s plugged n. We smulate both the benchmark case and LSL algorthm for several scenaros usng Matlab n a PC wth processor Intel Core U CPU 1.80 GHz. Frst, we compare the load profles for household 1 over two days n the benchmark scenaro (wthout load schedulng) and the LSL algorthm (wth load schedulng) n Fg. 3 (a). The EV chargng demands of household 1 n the frst and second days are 6 and 8 hours, respectvely. Wth the LSL algorthm, the ECC of household 1 schedules the operatng applances to reduce the payment. In partcular, snce the peak load wth schedulng n the frst day s much lower than that n the second day, the ECC of the foresghted household 1 charges the EV for 8.5 hours n the frst day (larger than the demand of 6 hours n the frst day), n order to reduce the chargng hour to 5.5 hours n the second day. Such a chargng schedule reduces the peak load n the second day. Fg. 3 (b) shows the aggregate load demand of all users durng one sample day. The peak load s about 1.9 MW around 8 pm wthout load schedulng. When the households deploy LSL algorthm, the ECCs schedule the controllable applances to off-peak hours Fg. 4. Daly average cost for myopc and foresghted household 1. at the MPE. The peak load decreases by 27% to 1.4 MW. Fg. 3 (c) shows the aggregate load profle of all users over one week. The peak load reducton can be observed n all days wth the LSL algorthm. The LSL algorthm benefts the users by reducng ther daly average cost. We perform smulatons for β = 0.995, 0.8, 0.5, 0.2, 0.05, whch ncludes the extreme cases of foresghted users (β = 0.995) and myopc users (β = 0.05). We present the daly average cost of household 1 for dfferent values of β n Fg. 4. The ntal value of $4.8 per day s the daly average cost wthout load schedulng. When household 1 s foresghted, ts daly average cost decreases by 28% (from $4.8 per day to $3.5 per day). When β decreases, the daly average cost ncreases gradually. For a myopc user, the daly average cost decreases by 11% (from $4.8 per day to $4.3 per day). The reason s that the ECC for foresghted users schedules the applances consderng the prce n the current and future tme slots. Fg. 5 (a) shows the chargng profle of the EV for household 1 wth a myopc user. Fg. 5 (b) shows the dynamcs of electrcty prce over two days when the users are myopc. The ECC of the myopc user (wth β = 0.05) consders the daly prce fluctuatons and charges the EV just to fulfll the chargng demand (for 6 hours). Fg. 5 (c) shows the chargng profle of the EV for household 1 wth a foresghted user. Fg. 5 (d) shows the dynamcs of electrcty prce when the users are foresghted. The ECC of the foresghted user (wth β = 0.995), on the other hand, takes advantage of the prce fluctuatons over multple days (n partcular the low current prce) and charges the EV more than the current chargng demand (for 8.5 hours) n order to reduce cost n the followng day when the prce n the chargng perod s hgh. The LSL algorthm helps the utlty company reduce the PAR n the aggregate load demand. We compute the expected PAR over a perod of 2 months n Fg. 6. We consder two specal cases of the hdden Markov model n Assumpton 1,.e., the perodc and random prce parameters, respectvely, to evaluate the performance of LSL algorthm. Wth perodc prce parameters, the LSL algorthm performs well and reduces the PAR from 2.3 to 2.02 (13% reducton) n 3000 tme slots (about a month). For random prce parameters, we assume that the utlty company chooses lt th, t T from a truncated normal dstrbuton wth a mean value shown n Fg. 2 (a) and a standard devaton of 0.2 MW. The parameters λ 1,t and λ 2,t, t T are also chosen from a truncated normal dstrbuton wth a mean value shown n Fg. 2 (b) and a

10 EV chargng rate (kw) Myopc user Expected PAR Wth random prce parameters Wth perodc prce parameters Prce ($/MW) EV chargng rate (kw) Prce ($/MW) 0 6 am 12 pm 6 pm 12 am 6 am 12 pm 6 pm 12 am 6 am am 12 pm 6 pm 12 am 6 am 12 pm 6 pm 12 am 6 am Foresghted user Tme (hour) 0 6 am 12 pm 6 pm 12 am 6 am 12 pm 6 pm 12 am 6 am am 12 pm 6 pm 12 am 6 am 12 pm 6 pm 12 am 6 am c Tme (hour) Fg. 5. (a) The EV s chargng schedule when household 1 s myopc (β = 0.05); (b) the electrcty prce when users are myopc; (c) the EV s chargng schedule when household 1 s foresghted (β = 0.995); (d) The electrcty prce when users are foresghted. standard devaton of 5 $/MW. The random prce parameters can model abnormal fluctuatons (such as spkes n the prce values). In practce, the probablty dstrbutons for the prce parameters can be estmated from the hstorcal prce data. Nevertheless, our LSL algorthm s model-free, hence the ECCs do not need to know the probablty dstrbutons of the prce parameters. Results shows that the ECCs can stll effectvely determne ther MPE polces through learnng, but t takes 6500 tme slots (about two months) for the Iteraton number Fg. 6. Expected PAR of the LSL algorthm wth perodc and random prce parameters. PAR to converge to Thus, LSL algorthm has a robust performance even n a market wth random fluctuatons n the prce parameters. We show that Algorthm 2 converges to the MPE by usng the MPE characterzaton n Theorem 2. Fg. 7 depcts the absolute values of the approxmate objectve functon f obj, π k ) for households 1, 2, and 3. It shows that the objectve values converge to zero (we have the same result for other households), whch s the global optmal soluton of problem (23). Thus from Theorem 2, the LSL algorthm converges to the MPE of Game 2. Though the acton and state spaces of each household are large, the speed of convergence s acceptable as a result of usng the value functon and polcy approxmatons. The jumps n the curves n Fg. 7 correspond to the teratons where the bass functons n (31) and (32) are added to the bass functon sets. In our smulaton, the runnng tme of the LSL algorthm per teraton per household s only a few seconds. As the households only need to go through one teraton of computaton per tme slot (e.g., 15 mns), the proposed algorthm s sutable for real-tme executons. We compare the LSL algorthm wth a schedulng algorthm based on Q-learnng to demonstrate the beneft of the actorcrtc method. Q-learnng has been used n some exstng learnng algorthms for demand response (e.g., [7] and [8]). We consder an algorthm based on Q-learnng wth the same structure as LSL algorthm, wth the only dfference that the ECC updates the Q-functons [22, Ch. 6]. Fg. 8 shows the daly average cost of household 1 usng the LSL algorthm and the Q-learnng benchmark. In each teraton of the Q-learnng benchmark, the polces are obtaned from the updated values of the Q-functons (whch s computed based on the Boltzmann exploraton as n [7]). The polcy update suffers from hgh fluctuatons and slow learnng. Our proposed algorthm converges much smoother, wth a total convergence tme around 25% of that of the Q-learnng benchmark. To study how the observaton profle approxmaton n Algorthm 1 affects the users polcy, we compare the households polces n two scenaros. In the frst scenaro, the states are partally observable to the ECCs. They wll use Algorthm 1 to approxmate the observaton profle of all households. In the second scenaro, the utlty company shares the state of all households wth each ECC. Thus, the states become fully observable to the ECCs. The LSL algorthm can be used n both scenaros to determne the MPE polcy of the households. (V π,k

11 (V π, k, π k ) Household 1 Household 2 Household Iteraton number Fg. 7. Objectve value (V π,k, π k ) for households 1, 2, and partally observable stochastc game, where each household ams at mnmzng ts dscounted average cost n a realtme prcng market. We proposed a dstrbuted and modelfree learnng algorthm based on the actor-crtc method to determne the MPE polcy. We used the value functon and polcy approxmaton technque to reduce the acton and state spaces of the households and mprove the learnng speed. Smulaton results show that the expected PAR n the aggregate load can be reduced by 13% when users deploy the proposed algorthm. Furthermore, the foresghted users can beneft from 28% reducton n ther expected dscounted cost n long-term, whch s 17% lower than the expected cost of the myopc users. For future work, we plan to extend our LSL algorthm to a deregulated market, where multple households partcpate n demand response program and can choose to purchase electrcty from multple utlty companes Fg. 8. Daly average cost for household 1 wth the algorthm based on Q-learnng and our proposed LSL algorthm. Aggregate load (MW) am Wthout load schedulng Partally observable load schedulng Fully observable load schedulng 12 pm 6 pm 12 am 6 am Tme (hour) Fg. 9. The aggregate load demand wth the partally observable load schedulng and fully observable load schedulng. Fg. 9 shows the aggregate load demand n both scenaros over one day, wth and wthout load schedulng. When the states are partally observable, the ECCs play a sequence of Bayesan games n Game 2. As each ECC has ncomplete nformaton about other households states, t determnes an optmal polcy that mnmzes the expected cost n all possble states of other households under a gven approxmate observaton profle. When the states are fully observable, the ECCs play a sequence of normal form games. As each ECC knows the actual state of other households, ts polcy becomes the best response for the actual state of the system. Fg. 9 shows that when the states become fully observable, the peak n the aggregate load demand further decreases when the aggregate load s around the threshold lt th. Ths reduces the expected cost of the households, e.g., the daly average cost of household 1 s reduced by 6.3% (from $3.5 per day to $3.28 per day). VI. CONCLUSION In ths paper, we formulated the schedulng problem of the controllable applances n the resdental households as a A. The Proof of Equaton (1) APPENDIX Consder applance a A n household. Accordng to Defnton 1, δ a,,t for t T s the number of tme slots snce the most recent tme slot that applance a becomes awake wth the most recent new task. In other words, applance a has not become awake wth a new task agan n tme slots t δ a,,t + 1,..., t snce t became awake n tme slot t δ a,,t + 1. The value of P a, (δ a,,t ) s the probablty that the dfference between two sequental wake-up tmes for applance a s δ a,,t. Gven the current tme slot t, the probablty P a,,t+1 that applance a A becomes awake wth a new task n the next tme slot t + 1 T can be obtaned from the Bayes rule as P a,,t+1 = Prob{E 1 E 2 } Prob{E 2 }, (34) Prob{E 1 } where E 1 s the event that applance a has not become awake wth a new task untl tme slot t, and E 2 s the event that applance a becomes awake n tme slot t + 1 after δ a,,t tme slots snce t became awake wth the most recent task. Wth probablty Prob{E 1 E 2 } = 1, applance a has not become awake wth a new task untl tme slot t condtoned on the event that t becomes awake wth a new task n tme slot t + 1. Wth probablty Prob{E 2 } = P a, (δ a,,t ), applance a becomes awake n tme slot t+1 after δ a,,t tme slots snce t became awake wth the most recent task. Applance a has not become awake n tme slots t δ a,,t +1,..., t wth probablty Prob{E 1 } = 1 δ a,,t 1 =1 P a, ( ). Therefore, P a,,t+1 can be obtaned as (1). Ths completes the proof. B. The Proof of Theorem 1 The MPE polcy n Game 2 s the fxed pont soluton of every household s best response polcy. Household solves the Bellman equatons (20) for all approxmate observaton profles ô O when other households polces are fxed. We construct a Bayesan game from the underlyng fully observable game wth ncomplete nformaton as follows: Game 3 Bayesan Game Among Vrtual Households:

12 Players: The set of vrtual households, where each vrtual household (, ô) corresponds to each real household N and observaton profle ô O. Types: The type of each vrtual household (, ô) s the observaton o O of household. P (o ô) s the probablty that vrtual household (, ô) has type o. Strateges: The strategy for vrtual household (, ô) s the probablty dstrbuton π (ô) over the actons x (ô) ˆX (ô). Costs: The cost of each vrtual household (, ô) wth { } (ô, x (ô)), where strategy π (ô) s equal to E π(ô) Q π Q π (ô, x (ô)) s defned n (18). We consder the Bayesan Nash equlbrum (BNE) soluton concept for the underlyng Bayesan game among vrtual households. We show that the BNE corresponds to the MPE of Game 2 among households N. In Game 3, each vrtual household (, ô) ams to determne ts BNE strategy } π BNE (ô) to mnmze E π BNE (ô) {Q πbne (ô, x (ô)) when other vrtual households strateges are fxed. Therefore, n the BNE all vrtual households solve the Bellman equatons n (20). Consequently, the BNE of the Game 3 among vrtual households corresponds to the MPE of Game 2 among real households. A BNE always exsts for the Bayesan games wth a fnte number of players and actons [17, Ch. 6]. Thus, an MPE exsts for the fully observable game wth ncomplete nformaton among households. Ths completes the proof. C. The Proof of Theorem 2 We use an approach smlar to [9, Theorem 3.8.2] to show that the jont polcy π s an MPE f and only f (V π, π ) = 0 for all households N. Then, we obtan the condton n (24) for the polcy n an MPE. Our proof nvolves two steps. Step (a) Consder the jont polcy π and value functons V π (ô), N, n the feasble set of problem (23), for whch we have (V π, π ) = 0 for N. We show that the polcy π s an MPE. Accordng to the constrant set of problem (23), the Bellman errors for the actons n an approxmate observaton profle ô are non-negatve. Snce (V π, π ) s the expectaton over the Bellman errors, ts value s nonnegatve for all feasble polces and value functons. If (V π, π ) = 0 for all N, then the polcy π and the value functons V π (ô), N are the global optmum of problem (23) for all households. Hence, no household has the ncentve to unlaterally change ts polcy, n order to further reduce ts objectve value (V π, π ). In other words, the polcy π s an MPE. Next, we show that for an MPE polcy π MPE, we can determne a value functon V πmpe (ô) such that (V πmpe, π MPE ) = 0. From (22), (V πmpe, π MPE ) = 0 s equvalent to ) } {B (V πmpe, ô, x (ô) = 0, N. (35) ô O E π MPE (ô) Accordng to the constrant set of problem (23), the Bellman errors for the actons n an observaton profle ô are nonnegatve n the MPE. Thus, each term of the summaton n (35) should be zero. That s )} {B (V πmpe, ô, x (ô) = 0, ô O, N, E π MPE (ô) whch s equvalent to E π MPE (ô) π MPE {Q πmpe (ô, x (ô)) V πmpe (36) } (ô) = 0, ô O, N. (37) (ô) { s a randomzed polcy. Hence, we have E π MPE (ô) V π MPE (ô) } = V πmpe (ô). Hence, for all approxmate observaton profle ô O, (37) can be rewrtten as } V πmpe (ô)=e π MPE (ô) {Q πmpe (ô, x (ô)), N. (38) For household, we defne the average cost n approxmate observaton profle ô as c (ô) = E π MPE (ô) {c (ô, x(ô))}. We defne the average transton probablty from observaton ô to ô as P (ô ô) = E π MPE (ô) {P (ô ô, x(ô))}. We defne vectors c =( c (ô), ô O) and V πmpe = ( V πmpe (ô), ô O ), and defne the transton matrx P = [ P (ô ô), ô, ô O ]. By substtutng (18) nto (38), we have V πmpe = (1 β) c + β P V πmpe. (39) By rearrangng the terms n (39), we obtan ( I β P ) V π MPE = (1 β) c, (40) where I s the dentty matrx. Matrx P s a stochastc matrx (.e., each of ts entres s a nonnegatve real number representng a probablty), and thus ts egenvalues are less than or equal to one. Besdes, the dscount factor β s less than one. Hence, the egenvalues of matrx I β P are postve, and thereby t s nvertble (or nonsngular). From (40), we can obtan V πmpe as V πmpe = (1 β) ( I β P ) 1 c. (41) Therefore, for the MPE polcy π MPE, we obtan the value functon V πmpe (ô) n (41) such that (V πmpe, π MPE ) = 0 for all households N. Step (b) We obtan the condton n (24) for the polcy n an MPE. For each household N, the objectve functon (V πmpe (V πmpe ô O, π MPE, π MPE ) = x (ô) ˆX (ô) ) n (22) can be expressed as π MPE (ô, x (ô)) B ( V πmpe ), ô, x (ô). (42) The Bellman error B (V πmpe, ô, x (ô)) s nonnegatve. Hence, from (42),( (V πmpe, π MPE )) = 0 s equvalent to π MPE (ô, x (ô)) B V πmpe, ô, x (ô) = 0 for all households N wth acton x (ô) ˆX (ô) n observaton profle ô. Ths completes the proof. D. Bellman Error Approxmaton The bass functon n (32) s equal to the expectaton over the Bellman errors for all feasble actons

13 x (ô k 1 ) X (o k 1 ). ECC knows the observaton o k 1, the approxmate observaton profle ô k 1, and the cost c (ôk 1, x k 1 (ô k 1 ), x k 1 (ô k 1 ) ) for the chosen acton x k 1 (ô k 1 ) n teraton k 1, as well as the current observaton o k and the approxmate observaton profle ôk. ECC needs to use these avalable nformaton to approxmate the Bellman error for an arbtrary feasble acton x (ô k 1 ) X (o k 1 ). We use the TD error as an estmaton for the Bellman error [14, Lemma 3]. We have B k 1 ( V π,k 1, ô k 1, x (ô k 1 ) ) (1 β) c (ôk 1, x (ô k 1 ), x k 1 (ô k 1 ) ) + β V π,k 1 (ôk ( x (ô k 1 ) ), θ k 1 ) V π,k 1(ôk 1, θ k 1 ), (43) ( where ô k x (ô k 1 ) ) s the approxmate observaton profle n the current tme slot k f household chooses acton x (ô ( k 1 ) n the prevous tme slot k 1. ECC determnes ô k x (ô k 1 ) ) n the followng two steps: Step (a) ECC knows observatons o k 1 and o k. Thus t can determne the set of applances that become awake wth a new task n the current tme slot k. ECC can also determne the state of other operatng applances for an arbtrary feasble acton x (ô k 1 ). Therefore, ECC can determne the state of ts own household for an arbtrary feasble acton x (ô k 1 ). Step (b) The states of other households are fxed. Furthermore, ECC knows the approxmate observaton profle ô k for the chosen acton x k 1 (ô k 1 ). Usng the result of Step (a), ECC can compute the average aggregate load demands for the feasble actons of all households for an arbtrary feasble acton x (ô k 1 ) X (o k 1 ), and thus t can determne the approxmate observaton profle ô k (x (ô k )) for all households for acton x (ô k 1 ). In addton ( to computng the approxmate observaton profle ô k x (ô k 1 ) ), ECC needs to compute the cost c (ôk 1, x (ô k 1 ), x k 1 (ô k 1 ) ) for feasble acton x (ô k 1 ) X (o k 1 ). ECC knows the payment to the utlty company for the chosen acton x k 1 (ô k 1 ). Snce the load demand of one household s much smaller than the aggregate load demand of all households, we can assume that the prce value s unchanged when household unlaterally changes ts load demand. Thus, ECC can estmate ts payment for an arbtrary feasble acton x (ô k 1 ) X (o k 1 ). ECC can also determne the dscomfort cost for acton x (ô k 1 ) X (o k 1 ). Therefore, t can compute the cost c (ôk 1, x (ô k 1 ), x k 1 (ô k 1 ) ) for an arbtrary feasble acton x (ô k 1 ). Fnally, ECC s able to compute the approxmate Bellman error n (43). REFERENCES [1] Offce of Electrcty Delvery & Energy Relablty, Customer partcpaton n the smart grd: Lessons learned, U.S. Department of Energy, Tech. Rep., Sept [2] The Brattle Group, Freeman, Sullvan & Co., and Global Energy Partners, LLC, A natonal assessment of demand response potental, Federal Energy Regulatory Commsson, Tech. Rep., Jun [3] P. Samad, A. Mohsenan-Rad, V.W.S. Wong, and R. Schober, Realtme prcng for demand response based on stochastc approxmaton, IEEE Trans. on Smart Grd, vol. 5, no. 2, pp , Mar [4] Z. Chen, L. Wu, and Y. Fu, Real-tme prce-based demand response management for resdental applances va stochastc optmzaton and robust optmzaton, IEEE Trans. on Smart Grd, vol. 3, no. 4, pp , Dec [5] C. Eksn, H. Delc, and A. Rbero, Demand response management n smart grds wth heterogeneous consumer preferences, IEEE Trans. on Smart Grd, vol. 6, no. 6, pp , Nov [6] N. Forouzandehmehr, M. Esmalfalak, A. Mohsenan-Rad, and Z. Han, Autonomous demand response usng stochastc dfferental games, IEEE Trans. on Smart Grd, vol. 6, no. 1, pp , Jan [7] Z. Wen, D. O Nell, and H. Mae, Optmal demand response usng devce-based renforcement learnng, IEEE Trans. on Smart Grd, vol. 6, no. 5, pp , Sept [8] B. Km, Y. Zhang, M. van der Schaar, and J. Lee, Dynamc prcng and energy consumpton schedulng wth renforcement learnng, IEEE Trans. on Smart Grd, vol. 7, no. 5, pp , Sept [9] Y. Lang, L. He, X. Cao, and Z. J. Shen, Stochastc control for smart grd users wth flexble demand, IEEE Trans. on Smart Grd, vol. 4, no. 4, pp , Dec [10] F. Ruelens, B. J. Claessens, S. Vandael, B. D. Schutter, R. Babuska, and R. Belmans, Resdental demand response of thermostatcally controlled loads usng batch renforcement learnng, accepted for publcaton n IEEE Trans. on Smart Grd, [11] Y. Xao and M. van der Schaar, Dstrbuted demand sde management among foresghted decson makers n power networks, n Proc. of IEEE Conf. on Sgnals, Systems and Computers, Pacfc Grove, CA, Nov [12] J. Yao and P. Venktasubramanam, Optmal end user energy storage sharng n demand response, n Proc. of IEEE SmartGrdComm, Mam, FL, Nov [13] L. Ja, Q. Zhao, and L. Tong, Retal prcng for stochastc demand wth unknown parameters: An onlne machne learnng approach, n Proc. of Allerton Conf. on Communcaton, Control, and Computng, Montcello, IL, Oct [14] J. Flar and K. Vreze, Compettve Markov Decson Processes. NY: Sprnger, [15] E. Kala and E. Lehrer, Ratonal learnng leads to Nash equlbrum, Econometrca, vol. 39, no. 10, pp , Jul [16] A. Sandron, Does ratonal learnng lead to Nash equlbrum n fntely repeated games? Journal of Economc Theory, vol. 78, no. 1, pp , [17] M. Bowlng and M. Veloso, Ratonal and convergent learnng n stochastc games, n Proc. of Int l Conf. on Artfcal Intellgence, Seattle, WA, Aug [18] L. M. Dermed and C. L. Isbell, Solvng stochastc games, n Advances n Neural Informaton Processng Systems 22, Y. Bengo, D. Schuurmans, J. Lafferty, C. Wllams, and A. Culotta, Eds. Curran Assocates, Inc., 2009, pp [Onlne]. Avalable: [19] L. L and J. Shamma, LP formulaton of asymmetrc zero-sum stochastc games, n Proc. of IEEE Annual Conf. on Decson and Control, Los Angeles, CA, Dec [20] R. N. Borkovsky, U. Doraszelsk, and Y. Kryukov, A user s gude to solvng dynamc stochastc games usng the homotopy method, Operatons Research, vol. 58, no. 4-part-2, pp , Jul [21] M. Neely, A Lyapunov optmzaton approach to repeated stochastc games, n Proc. of Allerton Conference on Communcaton, Control, and Computng (Allerton), Montcello, IL, Oct [22] R. Sutton and A. Barto, Renforcement Learnng: An Introducton. Cambrdge, MA: MIT Press, [23] R. Emery-Montemerlo, G. Gordon, J. Schneder, and S. Thrun, Approxmate solutons for partally observable stochastc games wth common payoffs, n Proc. of Int l Conf. on Autonomous Agents and Multagent Systems, New York, NY, Jul [24] F. Olehoek, S. Whteson, and M. Spaan, Approxmate solutons for factored Dec-POMDPs wth many agents, n Proc. of Int l Conf. on Autonomous Agents and Multagent Systems, Sant Paul, MN, May [25] L. MacDermed, C. Isbell, and L. Wess, Markov games of ncomplete nformaton for mult-agent renforcement learnng, n Proc. of Int l Conf. on Artfcal Intellgence, San Fransco, CA, Aug [26] S. Bahram and V.W.S. Wong, An autonomous demand response program n smart grd wth foresghted users, n Proc. of IEEE SmartGrdComm, Mam, FL, Nov [27] I. H. Wtten, An adaptve optmal controller for dscrete-tme Markov envronments, Informaton and Control, vol. 34, no. 4, pp , Aug

[28] A. G. Barto, R. S. Sutton, and C. W. Anderson, Neuronlke adaptve elements that can solve dffcult learnng control problems, IEEE Trans. on Systems, Man, and Cybernetcs, vol. 13, no. 5, pp.

Ghavamzadeh, and M. Lee, Natural actorcrtc algorthms, Automatca, vol. 45, no. 11, pp. 2471 2482, Nov. 2009. [31] A. Mohsenan-Rad and A.

Bunn, Modellng Prces n Compettve Electrcty Markets. Tornoto, Canada: Wley Fnance, 2004. [33] R. S. Mamon and R. J. Ellott, Hdden Markov Models n Fnance. NY: Sprnger, 2014. [34] P. Yang, G.

14 [28] A. G. Barto, R. S. Sutton, and C. W. Anderson, Neuronlke adaptve elements that can solve dffcult learnng control problems, IEEE Trans. on Systems, Man, and Cybernetcs, vol. 13, no. 5, pp , Sept [29] V. Konda and J. Tstskls, On actor-crtc algorthms, SIAM Journal on Control and Optmzaton, vol. 42, no. 4, pp , Aug [30] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee, Natural actorcrtc algorthms, Automatca, vol. 45, no. 11, pp , Nov [31] A. Mohsenan-Rad and A. Leon-Garca, Optmal resdental load control wth prce predcton n real-tme electrcty prcng envronments, IEEE Trans. on Smart Grd, vol. 1, no. 2, pp , Sept [32] D. W. Bunn, Modellng Prces n Compettve Electrcty Markets. Tornoto, Canada: Wley Fnance, [33] R. S. Mamon and R. J. Ellott, Hdden Markov Models n Fnance. NY: Sprnger, [34] P. Yang, G. Tang, and A. Nehora, A game-theoretc approach for optmal tme-of-use electrcty prcng, IEEE Trans. on Power Systems, vol. 28, no. 2, pp , Aug [35] Y. Shoham and K. Leyton-Brown, Multagent Systems: Algorthmc, Game-Theoretc, and Logcal Foundatons. Cambrdge Unversty Press, [36] L. Busonu, R. Babuska, B. D. Schutter, and D. Ernst, Renforcement Learnng and Dynamc Programmng Usng Functon Approxmators. FL: CRC Press, [37] R. Parr, C. Panter-Wakefeld, L. L, and M. L. Lttman, Analyzng feature generaton for value-functon approxmaton, n Proc. of Int l Conf. on Machne Learnng, New York, NY, Jun [38] Toronto Hydro. [Onlne]. Avalable: /electrcsystem/resdental/yourbllovervew/pages/applancechart.aspx Janwe Huang (S 01-M 06-SM 11-F 16) s an IEEE Fellow, a Dstngushed Lecturer of IEEE Communcatons Socety, and a Thomson Reuters Hghly Cted Researcher n Computer Scence. He s an Assocate Professor and Drector of the Network Communcatons and Economcs Lab (ncel.e.cuhk.edu.hk), n the Department of Informaton Engneerng at the Chnese Unversty of Hong Kong. He receved the Ph.D. degree from Northwestern Unversty n 2005, and worked as a Postdoc Research Assocate at Prnceton Unversty durng He s the co-recpent of 8 Best Paper Awards, ncludng IEEE Marcon Prze Paper Award n Wreless Communcatons n He has coauthored sx books, ncludng the textbook on Wreless Network Prcng. He receved the CUHK Young Researcher Award n 2014 and IEEE ComSoc Asa-Pacfc Outstandng Young Researcher Award n He has served as an Assocate Edtor of IEEE/ACM Transactons on Networkng, IEEE Transactons on Wreless Communcatons, and IEEE Journal on Selected Areas n Communcatons - Cogntve Rado Seres, and IEEE Transactons on Cogntve Communcatons and Networkng. He has served as the Char of IEEE ComSoc Cogntve Network Techncal Commttee and Multmeda Communcatons Techncal Commttee. Shahab Bahram (S 12) receved the B.Sc. and M.A.Sc. degrees both from Sharf Unversty of Technology, Tehran, Iran, n 2010 and 2012, respectvely. He s currently a Ph.D. canddate n the Department of Electrcal and Computer Engneerng, The Unversty of Brtsh Columba (UBC), Vancouver, BC, Canada. Hs research nterests nclude optmal power flow analyss, game theory, and demand sde management, wth applcatons n smart grd. Vncent W.S. Wong (S 94, M 00, SM 07, F 16) receved the B.Sc. degree from the Unversty of Mantoba, Wnnpeg, MB, Canada, n 1994, the M.A.Sc. degree from the Unversty of Waterloo, Waterloo, ON, Canada, n 1996, and the Ph.D. degree from the Unversty of Brtsh Columba (UBC), Vancouver, BC, Canada, n From 2000 to 2001, he worked as a systems engneer at PMC-Serra Inc. (now Mcrosem). He joned the Department of Electrcal and Computer Engneerng at UBC n 2002 and s currently a Professor. Hs research areas nclude protocol desgn, optmzaton, and resource management of communcaton networks, wth applcatons to wreless networks, smart grd, and the Internet. Dr. Wong s an Edtor of the IEEE Transactons on Communcatons. He was a Guest Edtor of IEEE Journal on Selected Areas n Communcatons and IEEE Wreless Communcatons. He has served on the edtoral boards of IEEE Transactons on Vehcular Technology and Journal of Communcatons and Networks. He has served as a Techncal Program Co-char of IEEE Smart- GrdComm 14, as well as a Symposum Co-char of IEEE SmartGrdComm 13 and IEEE Globecom 13. Dr. Wong s the Char of the IEEE Communcatons Socety Emergng Techncal Sub-Commttee on Smart Grd Communcatons and the IEEE Vancouver Jont Communcatons Chapter. He receved the 2014 UBC Kllam Faculty Research Fellowshp.

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan Wnter 2008 CS567 Stochastc Lnear/Integer Programmng Guest Lecturer: Xu, Huan Class 2: More Modelng Examples 1 Capacty Expanson Capacty expanson models optmal choces of the tmng and levels of nvestments