Approximate dynamic programming using model-free Bellman Residual Elimination

Size: px

Start display at page:

Download "Approximate dynamic programming using model-free Bellman Residual Elimination"

Hilary Dawson
5 years ago
Views:

1 Approxiate dynaic prograing using ode-free Bean Residua Eiination The MIT Facuty has ade this artice openy avaiabe. Pease share how this access benefits you. Your story atters. Citation As Pubished Pubisher Bethke, B., and J.P. How. Approxiate dynaic prograing using ode-free Bean Residua Eiination. Aerican Contro Conference (ACC), Print. &isnuber= Institute of Eectrica and Eectronics Engineers / Aerican Autoatic Contro Counci Version Fina pubished version Accessed Sat Apr 06 07:06:48 EDT 209 Citabe Link Ters of Use Detaied Ters Artice is ade avaiabe in accordance with the pubisher's poicy and ay be subject to US copyright aw. Pease refer to the pubisher's site for ters of use.

2 200 Aerican Contro Conference Marriott Waterfront, Batiore, MD, USA June 30-Juy 02, 200 ThC07.2 Approxiate Dynaic Prograing Using Mode-Free Bean Residua Eiination Brett Bethke and Jonathan P. How Abstract This paper presents an odification to the ethod of Bean Residua Eiination (BRE) [], [2] for approxiate dynaic prograing. Whie prior work on BRE has focused on earning an approxiate poicy for an underying Markov Decision Process (MDP) when the state transition ode of the MDP is known, this work proposes a ode-free variant of BRE that does not require knowedge of the state transition ode. Instead, state trajectories of the syste, generated using siuation and/or observations of the rea syste in operation, are used to buid stochastic approxiations of the quantities needed to carry out the BRE agorith. The resuting agorith can be shown to converge to the poicy produced by the noina, ode-based BRE agorith in the iit of observing an infinite nuber of trajectories. To vaidate the perforance of the approach, we copare ode-based and ode-free BRE against LSPI [3], a we-known approxiate dynaic prograing agorith. Measuring perforance in ters of both coputationa copexity and poicy quaity, we present resuts showing that BRE perfors at east as we as, and soeties significanty better than, LSPI on a standard benchark probe. I. INTRODUCTION Markov Decision Processes (MDPs) are a powerfu and genera fraework for addressing probes invoving sequentia decision aking under uncertainty [4]. Unfortunatey, since MDPs suffer fro the we-known curse of diensionaity, the coputationa copexity of finding an exact soution is typicay prohibitivey arge. Therefore, in ost cases, approxiate dynaic prograing (ADP) techniques ust be epoyed to copute approxiate soutions that can be found in reasonabe tie [5]. Motivated by the success of kerne-based ethods such as support vector achines [6] and Gaussian processes [7] in pattern cassification and regression appications, researchers have begun appying these powerfu techniques in the ADP doain. This work has ed to agoriths such as GPTD [8], an approach which uses tepora differences to earn a Gaussian process representation of the cost-to-go function, and GPDP [9], which is an approxiate vaue iteration schee based on a siiar Gaussian process cost-to-go representation. Another recenty-deveoped approach, known as Bean Residua Eiination (BRE) [], [2], uses kernebased regression to sove a syste of Bean equations over a sa set of sape states. Siiar to the we-studied cass of inear architectures [3], [0], kerne-based cost representations can be interpreted as apping a state of the MDP into a set of features; however, B. Bethke, Ph.D. bbethke@it.edu J. P. How is a Professor of Aeronautics and Astronautics, MIT, jhow@it.edu unike inear architectures, the effective feature vector of a kerne-based representation ay be infinite-diensiona. This property gives kerne ethods a great dea of fexibiity and akes the particuary appropriate in approxiate dynaic prograing, where the structure of the cost function ay not be we understood. The prior work on BRE [], [2] has deonstrated that by taking advantage of this fexibiity, a cass of agoriths can be deveoped that enjoy severa advantageous theoretica properties, incuding the abiity to copute cost-to-go functions whose Bean residuas are identicay zero at the saped states. This property, in turn, iediatey ipies that the BRE agoriths are guaranteed to converge to the optia poicy in the iit of saping the entire state space. In addition to these theoretica properties, we have recenty deonstrated encouraging resuts in appying BRE in order to copute approxiate poicies for arge-scae, uti-agent UAV panning probes []. The BRE work to date has assued that the underying state transition ode of the MDP is avaiabe, and has focused on deveoping a cass of agoriths with the stated theoretica properties that can use this ode inforation efficienty. Of course, in soe circustances, an exact ode ay not be known if the syste in question is not we understood. To cope with these situations, this paper presents a ode-free variation on the BRE agoriths deveoped to date. Instead of reying on knowedge of the state transition ode, the new ode-free approach uses state trajectory data, gathered either by siuation of the syste or by observing the actua syste in operation, to approxiate the quantities needed to carry out BRE. The paper begins by presenting background ateria on ode-based BRE. It then shows how to odify the basic BRE approach when ony state trajectory data, and not a fu syste ode, is avaiabe. Severa proofs deonstrate the correctness of the ode-free BRE approach, estabishing that the ode-free agorith converges to the sae resut as the ode-based agorith as ore and ore trajectory data is used. Finay, resuts of a coparison study between BRE and LSPI, a we-known approxiate dynaic prograing agorith [3], show that both ode-based and ode-free BRE perfor at east as we as, and soeties significanty better than, LSPI on a standard benchark probe. II. BACKGROUND A. Markov Decision Processes This paper considers the genera cass of infinite horizon, discounted, finite state MDPs. The MDP is specified by (S, A, P, g), where S is the state space, A is the action space, /0/$ AACC 446

3 P ij (u) gives the transition probabiity fro state i to state j under action u, and g(i, u, j) gives the cost of oving fro state i to state j under action u. To derive ode-based BRE, we assue that the MDP ode (i.e. the data (S, A, P, g)) is known. Future costs are discounted by a factor 0 < α <. A poicy of the MDP is denoted by µ : S A. Given the MDP specification, the probe is to iniize the cost-to-go function J µ over the set of adissibe poicies Π: [ ] in J µ(i 0 ) = in E α k g(i k, µ(i k )). µ Π µ Π k=0 For notationa convenience, the cost and state transition functions for a fixed poicy µ are defined as g µ i P ij (µ(i))g(i, µ(i), j) () P µ ij P ij (µ(i)), respectivey. The cost-to-go for a fixed poicy µ satisfies the Bean equation [4] J µ (i) = g µ i + α P µ ij J µ(j) i S, (2) which can aso be expressed copacty as J µ = T µ J µ, where T µ is the (fixed-poicy) dynaic prograing operator. B. Bean Residua Eiination Bean Residua Eiination is a approxiate poicy iteration technique that is cosey reated to the cass of Bean residua ethods [2] [4]. Bean residua ethods attept to perfor poicy evauation by iniizing an objective function of the for i es J µ (i) T µ Jµ (i) 2 /2, (3) where J µ is an approxiation to the true cost function J µ and S S is a set of representative sape states. BRE uses a fexibe kerne-based cost approxiation architecture to construct Jµ such that the objective function given by Eq. (3) is identicay zero. We now present the basic derivation of the BRE approach; for ore detais, see [], [2]. To begin, BRE assues the foowing functiona for of the cost function: J µ (i) = Θ, Φ(i). (4) Here,, denotes the inner product in a Reproducing Kerne Hibert Space (RKHS) [5] with the pre-specified kerne k(i, i ) = Φ(i), Φ(i ). Φ(i) is the feature apping of the kerne k(i, i ), and Θ is a weighting eeent to be found by the BRE agorith. The ain insight behind BRE is the observation that the individua Bean residuas BR(i), given by BR(i) J µ (i) T µ Jµ (i), (5) can be siutaneousy forced to zero by foruating a regression probe, which can then be soved using standard kerne-based regression schees such as support vector regression [6] or Gaussian process regression [7]. To construct the regression probe, we substitute the functiona for of the cost, given by Eq. (4), into Eq. (5): BR(i) = Θ, Φ(i) g µ i + α P µ ij Θ, Φ(j). By expoiting inearity of the inner product,, we can express BR(i) as BR(i) = g µ i + Θ, Φ(i) α P µ ij Φ(j) = g µ i where Ψ(i) is defined as + Θ, Ψ(i), Ψ(i) Φ(i) α P µ ijφ(j). (6) By the preceding construction, every cost function J µ is associated with a function W µ Θ, Ψ(i) (7) which is cosey reated to the Bean residuas through the reation W µ (i) = BR(i) + g µ i. Therefore, the condition that the Bean residuas are identicay zero at the sape states S is equivaent to finding a function W µ satisfying W µ (i) = g µ i i S. (8) Eq. (7) ipies that W µ beongs to a RKHS whose kerne is given by K(i, i ) = Ψ(i), Ψ(i ) (9) = Φ(i) α P µ ij Φ(j), Φ(i ) α P µ i jφ(j) (0) K(i, i ) is caed the Bean kerne associated with k(i, i ). Soving the regression probe given by Eq. (8), using the associated Bean kerne and any kerne-based regression technique (such as support vector regression, Gaussian process regression, etc), yieds the weighting eeent Θ. By the representer theore [6, 4.2], Θ can be expressed as a inear cobination of features Ψ of the sape states: Θ = i e S λ i Ψ(i ). Therefore, the output of soving the kerne regression probe is copacty described by the coefficients {λ i } i e S. Once these coefficients are known, Eqs. (4) and (6) can be 447

4 used to copute the approxiate cost function J µ : i i J µ (i) = Θ, Φ(i) = i e S λ i Ψ(i ), Φ(i) = λ i k(i, i) α P µ i jk(j, i). () i S e j... j j... j To suarize, the preceding construction has shown how to perfor poicy evauation for a fixed poicy µ, given a base kerne k(i, i ) and a set of sape states S. The steps to perfor approxiate poicy iteration are as foows: ) Sove the regression probe [Eq. (8)] using the associated Bean kerne K(i, i ) [Eq. (0)] and any kernebased regression technique to find the coefficients {λ i } i S e. 2) Use the coefficients found in step to copute the cost function J µ [Eq. ()]. 3) Perfor a poicy iproveent step based on the cost function found in step 2, and return to step to evauate this new poicy. Under a technica, nondegeneracy condition on the kerne k(i, i ) (which is satisfied by any coony-used kernes incuding radia basis function kernes), it can be proven that the Bean residuas of the resuting cost function J µ are aways identicay zero at the states S, as desired [2]. As a resut, it is straightforward to show that BRE is guaranteed to yied the exact cost J µ in the iit of saping the entire state space ( S = S). Graph Structure of the Associated Bean Kerne: It is usefu to exaine the associated Bean kerne K(i, i ) in soe detai to understand its structure. Eq. (9) shows that K(i, i ) can be viewed as the inner product between the feature appings Ψ(i) and Ψ(i ) of the input states i and i, respectivey. In turn, Eq. (6) shows that Ψ(i) represents a new feature apping that takes into account the oca graph structure of the MDP, since it is a inear cobination of Φ features of both the state i and a of the successor states j that can be reached in a singe step fro i (these are a the states j for which P µ ij are nonzero). Figure is a graphica depiction of the associated Bean kerne. Using the figure, we can interpret the associated Bean kerne as easuring the tota overap or siiarity between i, i, and a iediate (one-stage) successor states of i and i. In this sense, the associated Bean kerne autoaticay accounts for the oca graph structure in the state space. Coputationa Copexity The coputationa copexity of ode-based BRE is doinated by two factors: first, coputing K(i, i ) over every pair of sape states (i, i ) S S (this inforation is often caed the Gra atrix of the kerne), and second, soving the regression probe. As iustrated in Fig. (), coputing K(i, i ) invoves enuerating each successor state of both i and i and evauating the base kerne k(, ) for Fig. : Graph structure of the associated Bean kerne K(i, i ) [Eq. (9)]. The one-stage successor states of i and i are {j,..., j } and {j,..., j }, respectivey. The associated Bean kerne is fored by taking a inear cobination of the base kerne k(, ) appied to a possibe state pairings (shown by the dashed bue ines) where one state in the pair beongs to the set of i and its descendants (this set is shown by the eft shaded rectange), and the other beongs to the set of i and its descendants (right shaded rectange). each pair of successor states. If β is the average branching factor of the MDP, each state wi have β successor states on average, so coputing K(i, i ) for a singe pair of states (i, i ) invoves O(β 2 ) operations. Therefore, if n s S is the nuber of sape states, coputing the fu Gra atrix costs O(β 2 n 2 s) operations. The cost of soving the regression probe ceary depends on the regression technique used, but typica ethods such as Gaussian process regression invove soving a inear syste in the diension of the Gra atrix, which invoves O(n 3 s) operations. Thus, the tota copexity of BRE is of order O(β 2 n 2 s + n 3 s). III. MODEL-FREE BRE The BRE agorith presented in the previous section requires knowedge of the syste dynaics ode, encoded in the probabiities P µ ij, to evauate the poicy µ. These probabiities are used in the cacuation of the cost vaues g µ i [Eq. ()], the associated Bean kerne K(i, i ) [Eq. (0)], and the approxiate cost-to-go J µ (i) [Eq. ()]. In particuar, in order to sove the BRE regression probe given by Eq. (8), it is necessary to copute the vaues g µ i for i S, as we as the Gra atrix K of the associated Bean kerne, defined by K ii = K(i, i ) i, i S. By expoiting the ode inforation, BRE is abe to construct cost-to-go soutions Jµ (i) for which the Bean residuas are exacty zero. However, there ay be situations in which it is not possibe to use syste ode inforation directy. Ceary, one such situation is when an exact syste ode is unknown or unavaiabe. In this case, one typicay assues instead that a generative ode of the syste is avaiabe, which is a back box siuator that can be used to sape state trajectories of the syste under a given poicy. Aternativey, 448

5 one ay coect state trajectory data fro the actua syste in operation and use this data in ieu of an exact ode. To address these situations, we now deveop a variant of the BRE agorith that uses ony siuated trajectory data, obtained fro siuations or fro observing the rea syste, instead of using data fro the true underying syste ode P µ ij. This variant of BRE therefore represents a true, odefree, reinforceent earning agorith. The key idea behind ode-free BRE is to siuate a nuber of trajectories of ength n, starting fro each of the sape states in S, and use this inforation to buid stochastic approxiations to the cost vaues g µ i, kerne Gra atrix K, and cost-to-go J µ (i). We use the notation T iq to denote the th state encountered in the q th trajectory starting fro state i S, where ranges fro 0 to n (the ength of the trajectory), and q ranges fro to (the nuber of trajectories starting fro each state i S). Using the trajectory data T iq, stochastic approxiations of each of the iportant quantities necessary to carry out BRE can be fored. First, exaining Eq. (), notice that the cost vaues g µ i can be expressed as g µ i = P ij (µ(i))g(i, µ(i), j) g(t iq 0, µ(t iq 0 iq ), T ), (2) where the expectation over future states has been repaced by a Monte Caro estiator based on the trajectory data. In the iit of saping an infinite nuber of trajectories ( ), the approxiation given by Eq. (2) converges to the true vaue of g µ i. A siiar approxiation can be constructed for the associated Bean kerne by starting with Eq. (6): Ψ(i) = Φ(i) α P µ ij Φ(j) = Φ(i) αe j [Φ(j)] Φ(i) α Φ(T iq (3) Substituting Eq. (3) into Eq. (0) gives K(i, i ) Φ(i) α Φ(T iq ), Φ(i ) α q = Φ(T i q ). Expanding this expression using inearity of the dot product gives K(i, i ) Φ(i), Φ(i ) α ( ) Φ(T iq ) + Φ(T i q ), Φ(i) + α 2 2 q = Φ(T iq, Φ(T i q. Finay, substituting the definition of the kerne k(i, i ) = Φ(i), Φ(i ) gives K(i, i ) k(i, i ) α ( ) k(t iq, i ) + k(t i q, i) α 2 2 q = k(t iq, T i q ). (4) Again, in the iit of infinite saping, Eq. (4) converges to K(i, i ). Finay, an approxiation to J µ (i) [Eq. ()] is needed: J µ (i) = λ i k(i, i) α P µ i jk(j, i) i es = i e S λ i (k(i, i) αe j [k(j, i)]) i e S λ i ( k(i, i) α k(t i q, i) ). (5) The procedure for carrying out ode-free BRE can now be stated as foows: ) Using the generative ode of the MDP (or data coected fro the rea syste), siuate trajectories of ength n starting fro each of the sape states i S. Store this data in T iq. 2) Sove the regression probe [Eq. (8)] using the stochastic approxiations to g µ i and K(i, i ) (given by Eqs. (2) and (4), respectivey), and any kerne-based regression technique to find the coefficients {λ i } i S e. 3) Use the coefficients found in step 2 to copute a stochastic approxiation to the cost function J µ, given by Eq. (5). 4) Perfor a poicy iproveent step based on the cost function found in step 3, and return to step to evauate this new poicy. Coputationa Copexity The overa copexity of running ode-free is sti doinated by buiding the kerne Gra atrix and soving the regression probe, just as in ode-based BRE. Furtherore, the copexity of soving the regression probe is the sae between both ethods: O(n 3 s). However, in odefree BRE, coputing an eeent of the associated Bean kerne using Eq. (4) now requires O( 2 ) operations (where is the nuber of saped trajectories), instead of O(β 2 ) (where β is the average branching factor of the MDP) as in the ode-based case. In essence, in ode-free BRE, the successor states for each sape state i S (of which there are on average β) are approxiated using data fro siuated trajectories. Thus, coputing the fu Gra atrix requires O( 2 n 2 s) operations, and the tota copexity of ode-free BRE is of order O( 2 n 2 s + n 3 s). 449

6 Correctness of Mode-Free BRE The foowing theore and ea estabish two iportant properties of the ode-free BRE agorith presented in this section. Theore: In the iit of saping an infinite nuber of trajectories (i.e. ), the cost function J µ (i) coputed by ode-free BRE is identica to the cost function coputed by ode-based BRE. Proof: In order to approxiate the quantities g µ i, K(i, i ), and J µ (i) in the absence of ode inforation, ode-free BRE uses the stochastic approxiators given by Eqs. (2), (4), and (5), respectivey. Exaining these equations, note that in each, the state trajectory data T iq is used to for an epirica distribution ˆP µ ij = δ(t iq, j), where δ(i, j) is the Kronecker deta function. This distribution is used as a substitute for the true distribution P µ ij in coputing g µ i, K(i, i ), and J µ (i). Since by assuption the individua trajectories are independent, the rando variabes {T iq q =,..., } are independent and identicay distributed. Therefore, the aw of arge nubers states that the epirica distribution converges to the true distribution in the iit of an infinite nuber of sapes: i ˆP µ ij = P µ ij. Therefore, as, Eqs. (2), (4), and (5) converge to the true vaues g µ i, K(i, i ), and J µ (i). In particuar, the cost function J µ (i) coputed by ode-free BRE converges to the cost function coputed by ode-based BRE as, as desired. Using resuts shown in [], [2], which prove that odebased BRE coputes a cost function J µ (i) whose Bean residuas are exacty zero at the sape states S, we iediatey have the foowing ea: Lea: In the iit, ode-free BRE coputes a cost function J µ (i) whose Bean residuas are exacty zero at the sape states S. Proof: The theore showed that in the iit, ode-free BRE yieds the sae cost function J µ (i) as ode-based BRE. Therefore, appying the resuts fro [], [2], it iediatey foows that the Bean residuas J µ (i) are zero at the sape states. IV. RESULTS A further series of tests were carried out to copare the perforance of the BRE agoriths presented in this paper to LSPI [3], a we-known approxiate dynaic prograing agorith. LSPI uses a inear cobination of basis functions to represent approxiate Q-vaues of the MDP, and earns a weighting of these basis functions using siuated trajectory data. In earning the weights, LSPI can take advantage of syste ode inforation if it is avaiabe, or can be run in a purey ode-free setting if not. Thus, our tests copared the perforance of four different agoriths: ode-based BRE, ode-free BRE, ode-based LSPI, and ode-free LSPI. The LSPI ipeentation used for these tests is the one provided by the authors in [3]. The benchark probe used was the chain-wak probe [3], [7], which has a one-diensiona state space (we used a tota of 50 states in these tests) and two possibe actions ( ove eft and ove right ) in each state. In order to ake the coparisons between BRE and LSPI as siiar as possibe, the sae cost representation was used in both agoriths. More precisey, in LSPI, five radia basis functions (with standard deviation σ = 2), with centers at x =,, 2, 3, 4, were used; whie in BRE, the sae radia basis kerne (with the sae σ) was used and the sape states were taken as S = {,, 2, 3, 4}. This ensures that neither agorith gains an unfair advantage by being provided with a better set of basis functions. The agoriths were copared on two different perforance etrics: quaity of the approxiate poicy produced by each agorith (expressed as a percentage of states in which the approxiate poicy atches the optia poicy), and running tie of the agorith (easured in the tota nuber of eeentary operations required, such as additions and utipications). The poicies produced LSPI and the ode-free variant of BRE depend on the aount of siuated data provided to the agorith, so each of these agoriths was run with different aounts of siuated data to investigate how the aount of data ipacts the quaity of the resuting poicy. Furtherore, since the siuated data is randoy generated according to the generative ode, each agorith was run utipe ties to exaine the effect of this rando siuation noise on the agorith perforance. The resuts of the coparison tests are shown in Fig. 2. Mode-based BRE, shown as the fied bue diaond, finds the optia poicy. Furtherore, its running tie is faster than any other agorith in the test by at east one order of agnitude. In addition, it is the ony agorith that is free fro siuation noise, and it therefore consistenty finds the optia poicy every tie it is run, unike the other agoriths which ay yied different resuts over different runs. Thus, if a syste ode is avaiabe, ode-based BRE has a cear advantage over the other agoriths in the chainwak probe. LSPI aso found the optia poicy in a nuber of the tests, confiring siiar resuts that were reported in [3]. However, the aount of siuation data required to consistenty find a near-optia poicy was arge (between 5,000 and 0,000 sapes), eading to ong run ties. Indeed, for any of the LSPI variants that consistenty found poicies within 0% of optia, a were between two and four orders of agnitude sower than ode-based BRE. As the aount of siuation data is decreased in an attept to reduce the soution tie of LSPI, the quaity of the produced poicies becoes ower on-average and aso ore inconsistent across runs. For siuated data sets of ess than about,000, the poicy quaity approaches (and soeties drops beow) 50% of optia, indicating perforance equivaent to (or worse than) sipy guessing one of the two actions randoy. A- 450

7 Furtherore, experienta coparison of BRE against another we-known approach, LSPI, indicates that whie both approaches find near-optia poicies, both ode-based and ode-free BRE appears to have severa advantages over LSPI in the benchark probe we tested. In particuar, ode-based BRE appears to be abe to efficienty expoit knowedge of the syste ode to find the poicy significanty faster than LSPI. In addition, ode-based BRE is free of siuation noise, eiinating the probe of inconsistent resuts across different runs of the agorith. Even when ode inforation is not avaiabe, ode-free BRE sti finds near-optia poicies ore consistenty and ore quicky than LSPI. Fig. 2: Coparison of BRE vs. LSPI for the chain-wak probe. Nubers after the agorith naes denote the aount of siuation data used to train the agorith. Note that ode-based BRE does not require siuation data. owing LSPI access to the true syste ode does appear to iprove the perforance of the agorith sighty; in Fig. 2, ode-based LSPI generay yieds a higher quaity poicy than ode-free LSPI for a given data set size (athough there are severa exceptions to this). However, the resuts indicate the ode-based BRE is significanty ore efficient than ode-based LSPI at expoiting knowedge of the syste ode to reduce the coputation tie needed to find a good poicy. Finay, exaining the resuts for ode-free BRE, the figure shows that this agorith yieds consistenty good poicies, even when a sa aount of siuation data is used. As the aount of siuation data is reduced, the variabiity of the quaity of poicies produced increases and the average quaity decreases, as woud be expected. However, both of these effects are significanty saer than in the LSPI case. Indeed, the worst poicy produced by ode-free BRE is sti within 2% of optia, whie using ony 50 siuation data. In contrast, LSPI exhibited a uch greater variabiity and uch ower average poicy quaity when the aount of siuation data was reduced. V. CONCLUSION This paper has presented a ode-free variant of the BRE approach to approxiate dynaic prograing. Instead of reying on knowedge of the syste ode, the ode-free variant uses trajectory siuations or data fro the rea syste to buid stochastic approxiations to the cost vaues g µ i and kerne Gra atrix K needed to carry out BRE. It is straightforward to show that in the iit of carrying out an infinite nuber of siuations, this approach reduces to the ode-based BRE approach, and thus enjoys the sae theoretica properties of that approach (incuding the iportant fact that the Bean residuas are identicay zero at the sape states S. ACKNOWLEDGMENTS Research supported by the Boeing Copany, Phanto Works, Seatte; AFOSR grant FA ; and the Hertz Foundation. REFERENCES [] B. Bethke and J. How, Approxiate dynaic prograing using Bean residua eiination and Gaussian process regression, in Proceedings of the Aerican Contro Conference, St. Louis, MO, [2] B. Bethke, J. How, and A. Ozdagar, Approxiate dynaic prograing using support vector regression, in Proceedings of the 2008 IEEE Conference on Decision and Contro, Cancun, Mexico, [3] M. Lagoudakis and R. Parr, Least-squares poicy iteration, Journa of Machine Learning Research, vo. 4, pp , [4] D. Bertsekas, Dynaic Prograing and Optia Contro. Beont, MA: Athena Scientific, [5] D. Bertsekas and J. Tsitsikis, Neuro-Dynaic Prograing. Beont, MA: Athena Scientific, 996. [6] A. Soa and B. Schökopf, A tutoria on support vector regression, Statistics and Coputing, vo. 4, pp , [7] C. Rasussen and C. Wiias, Gaussian Processes for Machine Learning. MIT Press, Cabridge, MA, [8] Y. Enge, Agoriths and representations for reinforceent earning, Ph.D. dissertation, Hebrew University, [9] M. Deisenroth, J. Peters, and C. Rasussen, Approxiate dynaic prograing with Gaussian processes, in Proceedings of the Aerican Contro Conference, [0] J. Si, A. Barto, W. Powe, and D. Wunsch, Learning and Approxiate Dynaic Prograing. NY: IEEE Press, [Onine]. Avaiabe: [] B. Bethke, J. How, and J. Vian, Muti-UAV persistent surveiance with counication constraints and heath anageent, in Proceedings of the AIAA Guidance, Navigation and Contro Conference, Chicago, IL, August [2] P. Schweitzer and A. Seidan, Generaized poynoia approxiation in Markovian decision processes, Journa of atheatica anaysis and appications, vo. 0, pp , 985. [3] L. C. Baird, Residua agoriths: Reinforceent earning with function approxiation. in ICML, 995, pp [4] R. Munos and C. Szepesvári, Finite-tie bounds for fitted vaue iteration, Journa of Machine Learning Research, vo., pp , [5] N. Aronszajn, Theory of reproducing kernes, Transactions of the Aerican Matheatica Society, vo. 68, pp , 950. [6] B. Schökopf and A. Soa, Learning with Kernes: Support Vector Machines, Reguarization, Optiization, and Beyond. MIT Press, Cabridge, MA, [7] D. Koer and R. Parr, Poicy iteration for factored MDPs, in UAI, C. Boutiier and M. Godszidt, Eds. Morgan Kaufann, 2000, pp

A New Method of Transductive SVM-Based Network Intrusion Detection

A New Method of Transductive SVM-Based Network Intrusion Detection Manfu Yan and Zhifang Liu 2 Departent of Matheatics, Tangshan Teacher s Coege, Tangshan Hebei, China 3005@tstc.edu.cn 2 Network Technoogy