UNIVERSITY OF NAIROBI

Size: px

Start display at page:

Download "UNIVERSITY OF NAIROBI"

Merry Stewart
5 years ago
Views:

UNIVERSITY OF NAIROBI FACULTY OF ENGINEERING DEPARTMENT OF ELECTRICAL AND INFORMATION ENGINEERING COMBINED REAL AND REACTIVE DISPATCH OF POWER USING REINFORCEMENT LEARNING.

1 UNIVERSITY OF NAIROBI FACULTY OF ENGINEERING DEPARTMENT OF ELECTRICAL AND INFORMATION ENGINEERING COMBINED REAL AND REACTIVE DISPATCH OF POWER USING REINFORCEMENT LEARNING. PROJECT INDEX : 045 SUBMITTED BY : NYABUGA JOSHUA TUONI F17/1372/2010 SUPERVISOR : MR. PETER MUSAU EXAMINER : MR. OGABA PROJECT REPORT SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE AWARD OF THE DEGREE OF BACHELOR OF SCIENCE IN ELECTRICAL AND ELECTRONICS ENGINEERING OF THE UNIVERSITY OF NAIROBI SUBMITTED ON : 24 TH APRIL, 2015

2 DECLARATION OF ORIGINALITY NAME OF STUDENT: NYABUGA JOSHUA TUONI REGISTRATION NUMBER: F17/1372/2010 COLLEGE: Architecture and Engineering FACULTY/SCHOOL/INSTITUTE: Engineering DEPARTMENT: Electrical and Information Engineering COURSE NAME: Bachelor of Science in Electrical and Electronic Engineering. TITLE OF WORK: COMBINED REAL AND REACTIVE DISPATCH OF POWER USING REINFORCEMENT LEARNING. 1. I understand what plagiarism is and I am aware of the university policy in this regard. 2. I declare that this final year project report is my original work and has not been submitted elsewhere for examination, award of a degree or publication. Where other people s work or my own work has been used, this has properly been acknowledged and referenced in accordance with the University of Nairobi s requirements. 3. I have not sought or used the services of any professional agencies to produce this work. 4. I have not allowed, and shall not allow anyone to copy my work with the intention of passing it off as his/her own work. 5. I understand that any false claim in respect of this work shall result in disciplinary action, in accordance with University anti-plagiarism policy. Signature: Date:.. i

3 CERTIFICATION This report has been submitted to the Department of Electrical and Information Engineering of the University of Nairobi with my approval as supervisor: Mr. Peter Musau. Signature : Date :..... ii

4 DEDICATION To my family, for continued support and prayers. iii

5 ACKNOWLEDGEMENT I would like to thank God for having taken good care of me throughout my academic life and for the good health he has granted me. I extend my gratitude to my supervisor, Mr. Musau, for the support, guidance, useful criticism and encouragement he gave me as I did my project. I appreciate all my lecturers and non-teaching stuff in the Department of Electrical and Information Engineering of the University of Nairobi for their contribution towards my degree. I also thank my classmates for the moral support they gave me as I undertook my project. Lastly, I thank my family for the support and understanding they have accorded me throughout my academic life. iv

6 TABLE OF CONTENTS COMBINED REAL AND REACTIVE DISPATCH OF POWER USING REINFORCEMENT LEARNING... 1 DECLARATION OF ORIGINALITY... i CERTIFICATION...ii DEDICATION... iii ACKNOWLEDGEMENT... iv List of Figures... vii List of Tables...viii List of Abbreviations... ix ABSTRACT...x 1 INTRODUCTION Combined Real and Reactive Dispatch of Power What is Economic Dispatch? Survey of Earlier Work: Genetic Algorithm(GA) Particle Swarm Optimization(PSO) Tabu Search(TS) Simulated Annealing(SA) Ant Colony Optimization(ACO) Neural Networks Hybrid Methods Problem Statement Justification Organization of the Report LITERATURE REVIEW Literature Review on Real Power Economic Dispatch Real Dispatch of Power Objective Function Literature Review on Reactive Power Economic Dispatch Minimize Var Cost Minimum Deviation From a Specific Point Voltage Stability Related Objectives v

7 2.2.4 Multi-Objective(MO) Reactive Power Dispatch and Voltage Control Reactive Dispatch of Power Objective Function Literature Review on Reinforcement Learning Background Information on Reinforcement Learning N-Arm Bandit Problem Parts of Reinforcement Learning Multi-stage Decision Problem (MDP) Methods for solving Multi-stage Decision Problems (MDP) Reinforcement Learning Approach for Solution Action Selection SOLUTION TO COMBINED REAL AND REACTIVE DISPATCH OF POWER USING REINFORCEMENT LEARNING (RL) Formulation of The Real and Reactive Dispatch of Power Problem using Reinforcement Learning Combined Active/Real and Reactive Power Cost RL Algorithm for Combined Real and Reactive Economic Dispatch using ε-greedy Strategy Learning Phase Policy Retrieval Phase Flowchart of RL Algorithm for Combined Real and Reactive Dispatch of Power RESULTS AND ANALYSIS Case Study: IEEE 14-Bus System Results Analysis and Discussion CONCLUSION AND RECOMMENDATIONS FOR FURTHER WORK Conclusion Recommendations for Further Work REFERENCES APPENDIX Matlab Code vi

8 LIST OF FIGURES Figure 2-1 : Voltage Stability Curve Figure 2-2 : Grid World Problem Figure 3-1 : Flowchart of RL Algorithm for ED Figure 4-1 : One Line Diagram of IEEE 14-Bus System [14] Figure 4-2 : Fuel Cost against Power Demand Figure 4-3 : Power Losses against Power Demand vii

9 LIST OF TABLES Table 4-1 : RL Parameters Table 4-2 : Real and Reactive Power Scheduling for a 14-Bus System Table A-0-1 : IEEE 14-Bus System Generator Data Table A-0-2 : IEEE 14-Bus Network Load and Generator Data [14] Table A-0-3 : IEEE 14-Bus Network Line Data [14] viii

10 LIST OF ABBREVIATIONS RL ED OPF MW IEEE GA PSO TS SA ACO MDP Reinforcement Learning Economic Dispatch Optimal Power Flow Megawatts Institute of Electrical and Electronics Engineering Genetic Algorithm Particle Swarm Optimization Tabu Search Simulated Annealing Ant-Colony Optimization Multi-stage Decision Problem ix

11 ABSTRACT Most economic dispatch problems involve real power only. With the integration of renewable energy into the grid, reactive power dispatch cannot be ignored any longer. This project shows how reactive power dispatch and real power dispatch are combined. This project proposes an effective algorithm that uses Reinforcement Learning (RL) for optimum generation dispatch to minimize the fuel cost. Various methods have been used to solve the Economic Dispatch (ED) problem. These include conventional methods such as linear programming, non-linear programming, mixed integer programming, interior points and quadratic programming. The non - conventional methods are Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Tabu Search (TS), Ant-Colony, simulated annealing, neural networks and hybrid techniques. In this project, Reinforcement Learning (RL) method has been used to develop an algorithm for economic dispatch. The developed algorithm has been tested on IEEE 14-bus, five generator network. The allocation schedule for the five generating units was found for the following sets of real and reactive power demands: 800MW & 370MVAR, 900MW & 470MVAR and 1000MW & 570MVAR. The optimal fuel costs for real power, reactive power and combined real and reactive power generation was also computed. x

12 1 INTRODUCTION 1.1 COMBINED REAL AND REACTIVE DISPATCH OF POWER What is Economic Dispatch? The economic dispatch problem is defined as that which minimizes the total operating cost of the power system while meeting the load plus transmission losses within generator limits [1]. The economic load dispatch problem involves the solution of two different problems: a) Unit commitment/pre-dispatch problem It is required to select optimally out of the available generating sources to operate, to meet the expected load and provide a specified margin of operating reserve over a specified period of time [2]. The unit commitment problem involves scheduling unit start up and shut down in a way that minimizes cost without compromising system security [3]. b) On-line economic dispatch It is required to distribute the load among the generating units actually paralleled with the system in such manner as to minimize the total cost of supplying the minute-to-minute requirements of the system [2]. Spinning reserve: This is a safety margin of generation where more units than necessary are kept on line, so that, should a unit unexpectedly fail, or the load rise unexpectedly, the system can meet the load requirement without interrupting service [3]. In load flow study for the power system, for a particular load demand, the generation at all the generator buses are fixed except at one generator bus known as slack, reference or swing bus where we allow the generation to take values within certain limits. In the case of economic load dispatch, the generations are not fixed but they are allowed to take values again within certain limits so as to meet a particular load demand with minimum fuel consumption. This means economic load dispatch is really the solution of a large number of load flow problems and choosing the one which is optimal in the sense that it needs minimum cost of generation [2]. 1.2 SURVEY OF EARLIER WORK: Optimization Methods The optimization techniques used include both conventional and non-conventional ones. Conventional optimization techniques include: 1

13 Linear programming Non-linear programming Mixed integer programming Interior points Quadratic programming, etc The disadvantage of these methods is that they converge to a local optimum solution. Non-conventional optimization techniques include: Genetic Algorithm(GA) Particle Swarm Optimization(PSO) Tabu Search(TS) Ant Colony Optimization(ACO) Simulated Annealing(SA) Neural Networks Hybrid techniques Genetic Algorithm(GA) GA is part of Evolutionary Algorithm (EA) which is a population-based optimization process. Other components of EA are Evolutionary Strategy (ES) and Evolutionary Programming (EP).GA is an optimization technique that is inspired by the process of natural selection. It does not differentiate the cost function and the constra ints and has a probability of convergence to a global optimum of one. It utilizes the operators of selection, crossover and mutation. It combines survival of the fittest among string structures with a structured, yet random, information exchange. In every generation, a new set of artificially developed strings is produced using elements of the fittest of the old; an occasional new element is experimented with for enhancement [4]. A starting population is built with random gene values and it evolves through several generations in which selection, crossover and mutation are repeated until a satisfactory solution has been found or a maximum number of iterations have been reached [5]. The algorithm identifies the individuals with the optimizing fitness values, and those with lower fitness will naturally get discarded from the population. However, GA cannot assure constant optimization response times thereby limiting its application in real-time applications Particle Swarm Optimization(PSO) This is an intelligent search technique that is inspired by social dynamics and the behavior emergent from socially organized populations known as swarms e.g. flocks of birds or schools of fish. The individuals are referred to as particles. The particles change their positions by flying around in a multidimensional search space until a relatively unchanged position has been encountered, or until computational limitations are exceeded [6]. A swarm of potential solutions are called particles. A particle bases its search not only on its 2

14 personal experience but also by the information given by the neighbors in the swarm. Each particle keeps track of its co-ordinates in the problem space, which is associated with the best solution fitness it has achieved so far. The fitness value is also stored. This value is called pbest. Another best value that is tracked by the particle swarm optimizer is the location, lbest, value obtained thus far by any particle in the neighbors of the particle. When a particle takes the whole population as its topological neighbors, the best value is a global best and is called gbest [7, 6]. A PSO system combines local search methods with global search methods. It has the problems of dependency on initial point and parameters, difficulty in finding their optimal design parameters, and the stochastic characteristic of the final outputs [8]. The main advantages of PSO are: easy implementation, simple concept, robustness to control the parameters and less computational time compared to other optimization technique [6] Tabu Search(TS) This is a kind of iterative search that is characterized by the use o f a flexible memory. It is able to eliminate local minima and to search areas beyond a local minimum. Therefore, it has the ability to find the global minimum of a multimodal search space. The process with which Tabu search overcomes the local optimality problem is based on an evaluation function that chooses the highest evaluation solution at each iteration. This means moving to the best admissible solution in the neighborhood of the current solution in terms of the objective value and tabu restrictions. The evaluation function selects the move that produces the most improvement or the least deterioration in the objective function. A tabu list is employed to store the characteristics of accepted moves so that these characteristics can be used to classify certain moves as tabu (i.e. to be avoided) in later iterations. The tabu list determines which solutions may be reached by a move from the current solution. Since moves not leading to improvements are accepted in tabu search, it is possible to return to already visited solutions. This might cause a cycling problem to arise. The tabu list is used to overcome this problem. The forbidding strategy is used to control and update the tabu list to avoid previously visited paths thus allowing exploration of new areas. An aspiration criterion is used to make a tabu solution free if this solution is of sufficient quality thus preventing cycling [7] Simulated Annealing(SA) Annealing is the physical process of heating up a solid and then cooling it down slowly until it crystallizes. At high temperatures, the atoms have high energies and have more freedom to arrange themselves. As the temperature is reduced, the atomic energies decrease. A crystal with regular structure is obtained at the state where the system has minimum energy. If the cooling is carried out very quickly, which is known as rapid quenching, widespread irregularities and defects are seen in the crystal structure. The system does not 3

15 reach the minimum energy state and ends in a polycrystalline state which has a higher energy [7]. In the analogy between a combinatorial optimization problem and the annealing process, the states of the solid represent feasible solutions of the optimization problem, the energies of the states correspond to the values of the objective function computed at those solutions, the minimum energy state corresponds to the optimal solution to the problem and rapid quenching can be viewed as local optimization. The algorithm consists of a sequence of iterations. Each iteration consists of randomly changing the current solution to create a new solution in the neighborhood of the current solution. The neighborhood is defined by the choice of the generation mechanism. Once a new solution is created the corresponding change in the cost function is computed to decide whether the newly produced solution can be accepted as the current solution. If the change in the cost function is negative the newly produced solution is directly taken as the current solution. Otherwise, it is accepted according to Metropolis's criterion [Metropolis et al., 1953] based on Boltzmann s probability [5] Ant Colony Optimization(ACO) ACO was inspired by the behavior of ants in their natural habitat. A colony of ants is able to succeed in a task to find the shortest path between the nest and the food source by depositing a chemical substance trail, called pheromone on the ground as they move. This pheromone can be observed by other ants and motivates them to follow the path with a high probability. This optimization technique is based on the indirect communication of a colony of simple agents, called (artificial) ants, mediated by (artificial) pheromone trail which serve as distributed, numerical information, which the ants use to probabilistically construct solutions to the problem. This is adapted by the ants during the algorithm s execution to reflect their search experience. In this way, the best solution has more intensive pheromone and higher probability to be chosen. The described behaviour of real ant colonies can be used to solve combinatorial optimization problems by simulation, using artificial ants searching the solution space by transiting from nodes to nodes. The artificial ants moves are usually associated with their previous action, stored in the memory with a specific data structure. The pheromone consistencies of all paths are updated only after the ant finishes its tour from the first node to the last node. Every artificial ant has a constant amount of pheromone stored in it when the ant proceeds from the first node. The pheromone that is stored will be distributed average on the path after artificial ants finished their tour. The quantity of pheromone will be high if artificial ants finished their tour with a good path. The pheromone of the routes progressively decreases by evaporation in order to avoid artificial ants stuck in local optima [8] Neural Networks Neural networks are modeled on the mechanism of the brain. Theoretically, they have a parallel distributed information processing structure. Two of the major features of neural 4

16 networks are their ability to learn from examples and their tolerance to noise and damage to their components. A neural network consists of a number of simple processing elements, also called nodes, units, short-term memory elements and neurons. These elements are modelled on the biological neuron and perform local information processing operations. A processing element has several inputs and one output which could be its own output, the output of other processing elements or input signals from external devices. Processing elements are connected to one another through links with weights which represent the strengths of the connections. The weight of a link determines the effect of the output of a neuron on another neuron. It can be considered part of the long-term memory in a neural network. After the inputs are received by a neuron, a pre-processing operation is applied. The output of the preprocessing operation is passed through a function called the activation function to produce the final output of the processing element. Depending on the problem, various types of activation functions are employed, such as a linear function, step function, sigmoid function, hyperbolic-tangent function, etc [7] Hybrid Methods These is a combination of two or more optimization methods with the aim of taking advantage of the pros of each method used in the mix while reducing on computation time hence speeding up convergence and/or better the quality of the solution. Examples include Expert System SA (ESSA).This seeks to use an expert system consisting of several heuristic rules to find a local optimal solution, which will be employed as an initial starting point of the second stage. This method is insensitive to the initial starting point, and so the quality of the solution is stable. It can deal with a mixture of continuous and discrete variables [9]. 1.3 PROBLEM STATEMENT In order to obtain an accurate cost function, the reactive power cost is to be included in the active power cost function. The total cost is given by combining the active and reactive power cost, giving the active power more weight than the reactive power. The objective function becomes as given below: Subject to: NG Minimize F Total = WF(F gi ) + (1 W)F(Q gi ) i=1 NG NB P gi P Di P L = 0 i=1 i=1 5

17 NG NB Q gi Q Di Q L = 0 i=1 i=1 P gi min P gi P gi max Q min max gi Q gi Q gi (i = 1, 2,.., NG) (i = 1, 2,.., NG) Where, Pgi, Qgi are the active/real and reactive generations of i th generator PDi, QDi are the active/real and reactive power demands PL, QL are the active/real and reactive power transmission losses NB is the number of buses NG is the number of generators For this project, W is taken to be 80% and consequently, (1 W) becomes 20%. Therefore, the combined objective function becomes: NG Minimize F Total = 0.8F(F gi ) + 0.2F(Q gi ) i=1 1.4 JUSTIFICATION While most of the optimization and soft computing techniques provide solution for static optimization tasks, Reinforcement Learning based strategies can easily provide solution for dynamic optimization problems. This makes Reinforcement Learning a good learning strategy suitable for real time control tasks and many optimization problems. In case of RL based solution strategies, the environment need not be a mathematically well defined one. It can acquire the knowledge or learn in a model free environment. Acquiring the knowledge of rewards or punishments to an action taken in the environment or state of the system, the learning strategy improves the performance step by step. Through a simple learning procedure with sufficient number of iterative steps, the agent can learn the best actions at any situation or state of the system. Also the reward or return function need not be a deterministic one, since at each action step the agent can accept the reward from a dynamic environment. 6

18 1.5 ORGANIZATION OF THE REPORT This project has been organized into five chapters as follows: In Chapter 1, the ED problem is introduced. Other optimization methods that can be used in solving the problem have been discussed. The problem statement and project objectives have also been discussed. In Chapter 2, a literature review of real and reactive dispatch of power has been done. A detailed literature review of reinforcement learning has been done as well. In Chapter 3, implementation of combined real and reactive dispatch of power using RL has been discussed in detail. The RL algorithm for solving economic dispatch is presented and its flowchart drawn. In Chapter 4, the simulation results obtained from programming in MATLAB are analyzed and discussed. In Chapter 5, conclusions are presented and recommendations for further work stated. 7

19 2 LITERATURE REVIEW 2.1 LITERATURE REVIEW ON REAL POWER ECONOMIC DISPATCH Let us assume that it is known priori which generators are to be run to meet a particular load demand on the station. Suppose there is a station with NG generators committed a nd the active power load demand P D is given, the real power generation P gi for each generator has to be allocated so as to minimize the total cost. The optimization problem can therefore be stated as: F(P gi ) = NG i=1 F i (P gi ).. 2.1a Subject to: i) the energy balance equation: NG i=1 P gi = P D b ii) the inequality constraints: P min max gi P gi P gi (i = 1, 2,.., NG) c where: P gi is the decision variable, that is, real power generation P D is the real power demand NG is the number of generation plants P gi min is the lower permissible limit of real power generation P gi max is the upper permissible limit of real power generation F i (P gi ) is the operating fuel cost of the i th plant and is given by the quadratic equation: F i (P gi ) = (a i P 2 gi + b i P gi + c i ) Ksh./hour. 2.1d The above constrained optimization problem is converted into an unconstrained optimization problem. Lagrange multiplier method is used in which a function is minimized(or maximized) with side conditions in the form of equality constraints. Using this method, an augmented function is defined as: L(P gi, λ) = F(P gi ) + λ(p D NG ) 2.2 i=1 P gi where λ is the Lagrangian multiplier. A necessary condition for a function F(P gi ), subject to energy balance constraint to have a x relative minimum at point P gi is that the partial derivative of the Lagrange function defined by L = L(P gi, λ) with respect to each of the arguments must be zero. So, the necessary conditions for the optimization problem are: L(P gi,λ) P gi = F(P gi ) P gi λ = 0 (i = 1,2,., NG). 2.3 and L(P gi,λ) λ = P D NG i=1 P gi = From equation 2.3, 8

20 F(P gi ) P gi = λ (i = 1, 2,, NG) 2.5 where F(P gi ) is the incremental fuel cost of the i th generator ($/MWh) P gi Optimal loading of generators corresponding to the equal incremental cost point of all the generators. Equation 2.5, called the co-ordination equations numbering NG are solved simultaneously with the load demand to yield a solution for Lagrange multiplier λ and the optimal generation of NG generators. Considering the cost function given by equation 2.1d, the incremental cost can be defined as: F(P gi ) P gi = 2a i P gi + b i 2.6 Substituting the incremental cost into equation 2.5, this equation becomes 2a i P gi + b i = λ (i = 1, 2,, NG) 2.7 Rearranging equation 2.7 to get P gi ; P gi = λ b i (i = 1,2,, NG) a i Substituting the value of P gi in equation 2.4, we get; or λ = P D + NG b i i=1 2a i 1 NG i=12a i NG λ b i = P 2a D i i= Thus, λ can be calculated using equation 2.9 and P gi can be calculated using equation 2.8. Now consider the effect of the generator limits given by the inequality constraint of equation 2.1c. If a particular generator loading P gi reaches the limit P gi min or P gi max, its loading is held fixed at this value and the balance load is shared between the remaining generators on an equal incremental cost basis [10]. Limit Constraint Fixing To fix up the limits, the following strategy can be applied: R Let h = 1 max R i=1 h i 2 min i=1 h i 2.10 where h max max i = P gi P gi (i = 1, 2,., R 1 upper bound violations) h min i = P min gi P gi (i = 1,2,., R 2 upper bound violations) i) If h > 0, fix all R1 upper bound violations to the upper limits, i.e., P max gi. ii) If h < 0, fix all R2 lower bound violations to the lower limits, i.e., P gi min. iii) On the other side, if h = 0, fix both R1 upper and R2 lower bound violations to their respective upper P gi max and lower P gi min limits. 9

21 Determine the new demand which is original PD minus the sum of fixed generation levels, i.e. P D new = P D R 1 +R 2 P gi i=1 The new demand is allocated to other committed generators on an equal incremental cost basis Real Dispatch of Power Objective Function The economic dispatch problem is defined as that which minimizes the total operating cost of a power system while meeting the total load plus transmission losses within generator limits. Mathematically, the problem is defined as: Minimize F(P gi ) = NG i=1 (a i P 2 gi + b i P gi + c i ) $/h.(2.11a) Subject to: i) the energy balance equation, NG i=1 P gi = P D + P L..(2.11b) ii) and the inequality constraints. P min max gi P gi P gi (i = 1, 2,.., NG) Where a i, b i, c i are the cost coefficients. P D is the load demand. P gi is the real power generation and will act as the decision variable. NG is the number of generation buses. P L is the transmission power loss. 2.2 LITERATURE REVIEW ON REACTIVE POWER ECONOMIC DISPATCH The majority of the RPP objectives were to provide the least cost of new reactive power supplies. Many variants of this objective include the cost of real power losses or the fuel cost. In addition, some technical indices such as deviation from a given voltage schedule or the security margin may be used as objectives for optimization [11] Minimize Var Cost Generally, there are two Var source cost models for minimization. The first formulation is to model Var source costs with C 1. Q c that represents a linear function with no fixed cost. Apparently, this model considers only the variable cost relevant to the rating of the newly 10

22 installed Var source Q c and ignores the fixed installation cost. The common unit for C1 is $/(MVar.hour). A better formulation with the format (C 0 + C 1.Q c ). x is to consider the fixed cost, C 0 ($/hour), which is the lifetime fixed cost prerated to per hour, in addition to the incremental/variable cost C 1 ($/MVar.hour) Minimize Var Cost and Real Power Losses. This objective may be divided into two groups: i) To minimize C 1 (Q c ) + C 2 (P loss ) ii) To miminize (C 0 + C 1.Q c ). x + C 2 (P loss ) Where C 2 (P loss ) represents the cost of real power loss C 0 is fixed cost. The objective can be written as follows: N c min F = C 1 (Q c ) + C 2 (P loss )k k=0 Where k(=0, 1,, L,, Nc), represents the k th operating case. Here, considered are the base case (k=0), the contingency cases under preventive mode (k=1,, L), and the contingency cases under corrective mode (k=l+1,, N c) Minimize Var Cost and Generator Fuel Cost. As an alternative to the cost of real power loss, the fuel cost is adopted as a direct measure of the operation cost. The minimization of real power loss cannot guarantee the minimization of the total fuel cost in general. Instead, minimization of the total fuel cost already includes the cost reduction due to the minimization of real power loss. This objective consists of the sum of the costs of the individual generating units. n C T = F i (P gi ) i=1 2 Where F i (P gi ) = a 0i + a 1i P gi + a 2i P gi is the common generator cost-versus-mw curves approximately modeled as a quadratic function, and a 0i, a 1i, a 2i are cost coefficients Minimum Deviation From a Specific Point This objective is usually defined as the weighted sum of the deviations of the control variables, such as bus voltages, from their given target values. The target values correspond to the initial or specified operating points. Minimizing voltage deviation, i.e., i (V imax V i ), where the subscript i represents different buses for voltage regulation. 11

23 2.2.3 Voltage Stability Related Objectives The main function of shunt reactive power compensation is to provide voltage support to avoid voltage instability or a largescale voltage collapse. As shown in Fig. 2.1, voltage stability is usually represented by a P-V (or S-V) curve. Figure 2-1 : Voltage Stability Curve The nose point of the P-V curve is called the point of collapse (PoC), where the voltage drops rapidly with an increase of load. PoC is also known as the equilibrium point, where the corresponding Jaco-bian becomes singular. Hence, power-flow solution fails to converge beyond this limit, which indicates voltage instability and can be associated with a saddle-node bifurcation point. These instabilities are usually local area voltage problems due to the lack of reactive power. Therefore, one objective can be to increase the static voltage stability margin (SM) de fined as the distance between the saddle-node-bifurcation point and the base case operating point. SM can be expressed as follows: where S i normal and S i critical are the MVA loads of load bus at normal operating state B and the voltage collapse critical state (PoC) A as shown in Fig. 2.1, respectively. One could expect an improvement in the stability of the system for that operating point Multi-Objective(MO) This objective includes Var investment cost minimization, power loss reduction and voltage deviation as follows: 12

24 Also, MO canbe given as follows: Also, Min F = 10 (voltage violation in p. u. ) (generator Var violation in p.u. ) 2 + power losses in p. u. Min F = (C 0 + C 1.Q c ). x + C 2 (P loss ) + ρ 1 ( V i V ispec Where, V i = voltage magnitude in bus i V ispec = specified voltage magnitude in bus i i V imax ) 2 + ρ 2 ( S l S lspec V imax = maximum allowable voltage deviation limit at bus i S l = MVA flow through line i S lspec =MVA capacity limit of line i S lmax = specified allowable line flow deviation limit ρ 1 and ρ 2 are weights for different objectives Reactive Power Dispatch and Voltage Control l 2 ) S lmax The reactive power and voltage control has a significant influence on the security of a power system. For efficient and reliable operation of power systems, voltages at the terminal of all equipment in the system must be maintained within desired limits for power system stability enhancement. Conventionally, minimization of total transmission line losses has been considered to be the main objective in reactive power dispatch. Of recent the trend has been towards the elimination of security constraint violations. Proper redistribution of reactive power generations will offer the following benefits : Reduction in real power transmission losses caused by unnecessary reactive power flows which will consequently result in the lowest production cost. Increase in system security from augmented reactive power reserves for emergencies. The reactive power dispatch objective thus seeks to minimize the active power losses in the network Reactive Dispatch of Power Objective Function Reactive power production cost is highly dependent on real power o utput. If a generator produces its maximum active power (Pmax) then no reactive power is produced and therefore, Apparent power (S) equals Pmax. However, reactive power production by a generator will reduce its capability to produce active power. Hence the production of reactive power by generator will result in reduction of its active power production. So to generate reactive power Qgi by generator i, which has been operating at its nominal power (Pmax), it is required to reduce its active power to Pgi ( Hasanpour, et.al., 2009). So at the 13

25 different values of Qgi with respect to Pgi the Quadratic cost expression for reactive power is calculated by fitting a curve into a quadratic polynomial. The fuel cost in terms of reactive power output can be expressed as: NG F(Q gi ) = (a qi Q 2 gi + b qi Q gi + c qi ) i=1 Where a qi, b qi, c qi are reactive power cost coefficients and are calculated using a curve fitting and NG is the number of generators. The above objective function is very simple and as it is extracted from the power cost function of the generator, it is more realistic and can provide accurate results in reactive power pricing [12]. 2.3 LITERATURE REVIEW ON REINFORCEMENT LEARNING Reinforcement learning (RL) refers to a class of learning algorithms in which a learning system learns which action to take in different situations by using a scalar evaluation received from the environment on performing an action. RL has been successfully applied to many multi stage decision making problems (MDP) where in each stage the learning systems decides which action has to be taken. Economic dispatch (ED) problem is an important scheduling problem in power systems, which decides the amount of generation to be allocated to each generating unit so that the total cost of generation is minimized without violating system constraints. In this project, formulation of economic dispatch problem as a multi stage decision making problem is done. Development of RL based algorithm to solve the ED problem is also done. The main advantage of RL is it can learn the schedule for all possible demands simultaneously Background Information on Reinforcement Learning Reinforcement Learning (RL) is the study of how animals and artificial systems can learn to optimize their behavior in the face of rewards and punishments. One way in which animals acquire complex behaviors is by learning to obtain rewards and to avoid punishments. Learning of a baby to walk, a child acquiring the lesson of riding bicycle, an animal learning to trap his food etc. are some examples. During this learning process, the agent interacts with the environment. At each step of interaction, on observing or feeling the current state, an action is taken by the learner. Depending on the goodness of the action at the particular situation, it is tried in the next stage when the same or similar situation arises (Bertsekas and Tsitsikilis [1996]. Sutton and Barto [1998], Sathyakeerthi and Ravindran [1996]). 14

26 The learning methodologies developed for such learning tasks originally combine two disciplines: Dynamic Programming and Function Approximation (Moore et al. [1996]). Dynamic Programming is a field of mathematics that has been traditionally used to solve a variety of optimization problems. However Dynamic Programming in its pure form is limited in size and complexity of the problems it can address. Function Approximation methods like Neural Networks learn the system by different sets of input - output pairs to train the network. In RL. the goal to be achieved is known and the system learns how to achieve the goal by trial and error interactions with the environment. In the conventional Reinforcement Learning frame work, the agent does not initially know what effects its actions have on the state of the environment and also what the immediate reward he will get on selecting an action. It particularly does not know what action is best to do. Rather, it tries out the various actions at various states, gradually learns which one is the best at each state so as to maximize its long term reward. The agent thus tries to acquire a control policy or a rule for choosing an action according to the observed current state of the environment. One most natural way to acquire the above mentioned control rule would be the agent to visit each and every state in the environment and try out the various possible actions. At each state it observes the effect of the actions in terms of rewards. From the observed rewards, best action at each state or best policy is manipulated. However this is not at all practically possible since planning ahead involves accurate enumeration of possible actions and rewards at various states which is computationaliy very expensive. Also such planning is very difficult since some actions may have stochastic effects, so that performing the same action at two different situations may give different reward values. One promising feature in such Reinforcement Learning problems is that there are simple learning algorithms by means of which an agent can learn an optimal rule or policy without the need for planning ahead. Also, such learning requires only a minimal amount of memory: an agent can learn if it can consider only the last action it took, the state in which it took that action and present state reached. The concept of Reinforcement Learning problem and action selection is explained with a simple N - arm bandit problem in the next section. A grid world problem is taken to discuss the different parts of the RL problem. Then the multi stage decision making tasks are explained. The various techniques of solution or learning are described through mathematical formulations. The different action selection strategies and one of the solution methods namely Q learning are discussed. The few applications of RL based learning in the fields of power system are also briefly explained [13]. 15

27 2.3.2 N-Arm Bandit Problem The N-arm bandit is a game based on slot machines. The slot machine is having a number of arms or levers. For playing the game, one has to pay a fixed fee. The player will obtain a monetary reward by playing an arm of his choice. The monetary reward may be greater or lesser than the fee he had paid. Also the reward from each arm will be around a mean value with some value of variance. The aim of the player is to obtain maximum reward or pay, by playing the game. If the play on an arm is considered as an action or decision, then the objective is to find the best action from the action set (set of arms). Since the reward is around a mean value. the problem is to find the action giving highest reward or the arm with highest mean value which can be called as best arm. To introduce the notations used in the thesis, action of choosing an arm is denoted by "a ". The goodness of choosing an arm or quality of an arm is the mean value of arm and is denoted by Q(a). If the mean of all arms are known the best arm is given by the equation,.(2.3.1) As mentioned earlier, the problem is that the Q(a) values are unknown. One simple and direct method is to play each arm a large number of times. Let the reward received in playing an arm in k th trial is r k (a). Then an estimate of Q(a) after n trials is obtained using the equation, By law of large numbers, (2.3.2) Now the optimal action is obtained by equation (2.3.1). To make the notation less cumbersome, the estimate of Q(a) will also be denoted by Q n (a). The above method termed as Brute force is time consuming. As a preliminary to understand an efficient algorithm for rmding Q values (mean values corresponding to each arm), a well known recursive method is now derived. As explained earlier, average based on n observations is given by, 16

28 ..(2.3.3) Therefore, Then, using equation (2.3.3), That is,.(2.3.4) The above equation tells that the new estimate based on n+1 th observation, r n+1 (a) is old estimate Q n (a) plus a small number times the error, {r n+1 (a) Q n (a)}. 17

29 There are results which say that under some technical conditions a decreasing 1 sequence {an} can be used instead of to get a recursive equation. That is, n+1 The sequence an is such that Now, an efficient method to find the best arm of the N-arm bandit problem can be explained. Step 1 : Initialize n=0, a=0.1 Step 2 : Intitialize Q o (a)=0 a A Step 3 : Select an action "a" using an action selection strategy Step 4 : Play the arm corresponding to action "a" and obtain the reward r n (a) Step 5 : Update the estimate of Q(a), Step 6 : n=n+1 Step 7 : If n < max _iteration, go to step 3 Step 8 : Stop To use the above algorithm, an efficient action selection strategy is required. One method would be to take an action with uniform probability. In this way one will play all the arms equal number of times. That is, throughout the learning the action space is explored. Instead of playing all the arms more number of times, it makes sense to play the arms which may be the best arm. One such efficient algorithm for action selection is ε - greedy. In this algorithm, the greedy arm is played with a probability (1- ε) and one of the other arms with a probability ε. Greedy arm (ag) corresponds to the arm with the best estimate of Q value. That is, It may be noted that if ε = 1, the algorithm will select one of the actions with uniform probability and if ε = 0, the greedy action will be selected. Initially, the estimates Q n (a) may not be true value. However as n, Q n (a) Q(a). and then we may exploit the information contained in Q n (a). So in ε - greedy algorithm, initially ε is chosen close to 1 and as n increases ε is gradually reduced. 18

30 Proper balancing of exploration and exploitation of the action space ultimately reduces the number of trials needed to find out the best arm. A more detailed discussion on the parts of Reinforcement Learning problem is given in the following sections Parts of Reinforcement Learning The earlier example discussed had only one state. In many practical situations, the problem may be to find the best action for different states. In order to make the characteristics of such general Reinforcement Learning problems clearer, and to identify the different parts of a Reinforcement Learning problem, a shortest path problem is considered in this section. Consider the grid world problem as given in Fig 2.2. Figure 2-2 : Grid World Problem The grid considered is having 36 cells arranged in 6 rows and 6 co lumns. A robot can be at anyone of the possible cells at any instant. G denotes the goal state to which the robot aim to reach and the crossed cells denote cells with some sort of obstacles. There is a cost associated with each cell transition while the cost of passing through a cell with obstacle is much higher compared to other cells. Starting from any initial position in the grid, robot can reach the goal cell by following different paths and 19

31 correspondingly cost incurred will also vary. The problem is to find an optimum path to reach the goal starting from anyone of the initial cell position. With respect to this example, the parts of the Reinforcement Learning problem can now be defined State Space The cell number can be taken as state of the robot at any time. The possible state the robot can occupy at any instant is coming from the entire cell space. In Reinforcement Learning Terminology, it is termed as state space. State space in Reinforcement Learning problem is defined as the set of possible states the agent (learner) can occupy at different instants of time. At any instant, the agent will be at any one of the state from the entire state space. The state of the robot at instant k can be denoted as x k. The entire state space is then taken as χ, so that at any instant k, x k εχ. In order to reach the goal state 'G' from the initial state x 0. the robot has to take a series of actions or cell transitions, a0, a1,..., an Action Space At any instant k, the robot can take any of the action (cell transition) ak from the set of permissible actions in the action set or action space.all:. The permissible set of actions at each instant k depends on the current state A k of the robot. If the Robot stays in any of the cells in the first column, 'move to Left' is not possible. Similarly for each cell in the grid world, there is a set of possible cell movements or state transitions. The set of possible actions or cell transitions at current state x k is denoted as A xk which also depend on the current state x k. For example if x k = 7, A xk ={ right, up, down} and if x k = 1, A xk ={ right, down} System Model Reinforcement Learning can be used to learn directly by interacting with the system. If that is not possible, a model is required. It need not be a mathematical model. A simulation model would also be sufficient. In this simple example, a mathematical model can be obtained. On taking an action the robot proceeds to the next cell position which is a function of the current state and action. In other words the state occupied by the robot in k+ 1, x k+1 depends on x k and a k. That is, (2.3.5) For example, if x k = 7 and x k =down, then x k+1 = 13 while when ak = up,xk+1 = 1. For this simple grid world, x k+1 is easily obtained by observation. For problems with larger state space, the state x k+1 can be found from the simulation model or studying the environment in which robot moves. The aim of a robot in the grid is to reach the goal state starting from its initial position or state at minimum cost. At each step it takes an 20

32 action which is followed by state transition or movement in the grid The actions which make state transitions to reach the goal state at minimum cost points out the optimum solution. Therefore the shortest path problem can be stated as finding the sequence of actions a0, a1,..., an-1 starting from any initial state such that the total cost for reaching goal state G is minimum Policy As explained in the previous section, whenever an action ak is taken in state xk, state transition occurs governed by equation (2.3.5). Ultimate learning solution is to find out a rule by which an action is chosen at any of the possible states. In other words a good mapping from the state space χ to action space A is to be derived. In Reinforcement Learning problems, any mapping from state space to action space is termed as policy and denoted as p. Then p(x} denotes the action taken by the robot on reaching state x. At any state x, since there are different possible paths to reach the goal, they are treated as different policies: p1(x), p1(x),, etc. The optimum policy at any state x is denoted as π (x). Reinforcement Learning methods go through iterative steps to evolve this optimal policy π (x). In order to find out the optimum policy, some modes of comparison among policies are to be formulated. For the same. the reward function to be defined which give a quantitative measure of the goodness of an action at a particular state Reinforcement Funtion Designing a reinforcement function is an important issue in Reinforcement Learning. Reinforcement function should be able to catch the objective of the agent. In some cases, it is straight forward; in some other cases it is not. For example, in the case of N- arm bandit problem (which can be viewed as a Reinforcement Learning problem with just one state), the reinforcement function is the return obtained while the agent play an arm. In the case of the grid world problem, the objective is to find the shortest path. In this case, it can be assumed that the system will incur a cost of one unit when the agent moves from one cell to another normal cell and incur a cost of "B"units when it moves to a cell with obstacle. The value "B" should be chosen depending on how bad the obstacle is. More formally, at stage k the agent perform an action ak at the state xk and move to a new state xk+1. The reinforcement function is denoted by g(xk, ak, xk+1). The reinforcement obtained in each step is also known as reward and is denoted by rk. The agent learns a sequence of action to minimize g(x k,a k, x k+1 ). In the case of learning by animals, the reward is obtained from the environment. However in the case of algorithms, the reinforcement function is to be defined. In this simple grid world, reinforcement function can be defined as, 21

33 If cell with obstacle has to be avoided, choose B = 1,000,000. If the obstacle is having very smaller effect then B can be chosen as 10. To find the total cost, cumulate the costs or rewards on each transition. Now the total k =(N 1) cost for reaching the goal state can be taken as g(x k,a k,x k+1 ) k =0 x0 being the initial state and N being the number of transitions to reach the goal state Value Function The issue is how the robot (in genernl, agent in Reinforcement Learning problem) can choose 'good' decisions in order to reach the goal state, starting from an initial state x, at minimum cost. Robot has to follow a good policy starting from the initial state in order to reach the goal at minimum cost. One measure to evaluate the goodness of a policy is the total expected discounted cost incurred while following a policy over N stages. Value function for any policy 1t, V π : χ R is defined to rate the goodness of the different policies. V π (x) represents the total cost incurred by starting in state x and following a policy 1t over N stages. Then, (2.3.6) Here γ is the discount factor. The reason for incorporating a discount factor is that, the real goodness of an action may not be reflected by its immediate reward. Value of γ is decided by the problem environment to account how much the future rewards to be discounted to rate the goodness of the policy at the present state. Discount factor can take a value between 0 and 1 based on the problem environment. A value 1 indicates that all the future rewards are having equal importance as the immediate reward. In this shortest path problem since all the costs are relevant to the same extent, γ is taken as 1. On formulating this objective function, a policy π 1 is said to be better than a policy π 2 when V π 1 (x) V π 2(x), x χ. The problem is to find an optimal policy π such that starting from an initial state x, the value function or expected total cost is lower when following policy π compared to any other policy π Π n That is, find π such that,, Π being the set of policies. 22

OPTIMAL DISPATCH OF REAL POWER GENERATION USING PARTICLE SWARM OPTIMIZATION: A CASE STUDY OF EGBIN THERMAL STATION

OPTIMAL DISPATCH OF REAL POWER GENERATION USING PARTICLE SWARM OPTIMIZATION: A CASE STUDY OF EGBIN THERMAL STATION Onah C. O. 1, Agber J. U. 2 and Ikule F. T. 3 1, 2, 3 Department of Electrical and Electronics