Dynamic Iterations for the Solution of Ordinary Differential Equations on Multicore Processors

Size: px

Start display at page:

Download "Dynamic Iterations for the Solution of Ordinary Differential Equations on Multicore Processors"

Derrick Harrell
5 years ago
Views:

1 Dynamic Ieraions for he Soluion of Ordinary Differenial Equaions on Mulicore Processors Yanan Yu Compuer Science Deparmen Florida Sae Universiy Tallahassee FL 336, USA Ashok Srinivasan Compuer Science Deparmen Florida Sae Universiy Tallahassee FL 336, USA Absrac In he pas few years, here has been a rend of providing increased compuing power hrough greaer number of cores on a chip, raher han hrough higher clock speeds. In order o exploi he available compuing power, applicaions need o be parallelized efficienly. We consider he soluion of Ordinary Differenial Equaions (ODE) on mulicore processors. Convenional parallelizaion sraegies disribue he sae space amongs he processors, and are efficien only when he sae space of he ODE sysem is large. However, users of a deskop sysem wih mulicore processors may wish o solve small ODE sysems. Dynamic ieraions, parallelized along he ime domain, appear promising for such applicaions. However, hey have been of limied usefulness because of heir slow convergence. They also have a high memory requiremen when he number of ime seps is large. We propose a hybrid mehod ha combines convenional sequenial ODE solvers wih dynamic ieraions. We show ha i has beer convergence and also requires less memory. Empirical resuls show a facor of wo o four improvemen in performance over an equivalen convenional solver on a single node. The significance of his paper lies in proposing a new mehod ha can enable small ODE sysems, possibly wih long ime spans, o be solved faser on mulicore processors.. Inroducion Parallel processing is now widely available due o he populariy of mulicore processors. A deskop, for example, may have wo quad-core processors. I may also be equipped wih a GPU having - cores. The Cell processor has eigh cores called SPEs, which are mean o handle he compuaional workload. Two of hese can be combined o form a 6-core shared memory machine. The Sony PS3 uses he same chip and makes six SPE cores available for use. Thus parallel processing is no longer resriced o he HPC communiy. Non-HPC users ofen have a need o solve smaller problems han he HPC communiy does. However, he speed of soluion can sill be imporan. For example, reducing he ime o solve an ODE from a minue o around en seconds is a noiceable difference in soluion ime. Applicaions will need o efficienly exploi he available parallelism, in order o ge improved performance. This can be difficul for small problems, because of he finer granulariy. We will consider he specific problem of solving an Iniial Value Problem involving an ordinary differenial equaion. We provide below a simple descripion of he problem and he convenional approach o solving i. (Readers wih a background in ODEs can skip he nex wo paragraphs.) A firs order ODE is given in he form du/d = f(, u), where u, which may be a vecor, is he sae of a physical sysem, and is ofen ime. Higher order ODEs can be expressed as a firs order ODE sysem using a sandard ransformaion [3]. The iniial sae of he sysem a some ime, u( ), is provided in an Iniial Value Problem. The problem is o deermine he value of u a subsequen values of ime, up o some ime span T. An ODE solver will ieraively deermine hese values by saring wih he known value of u a, and use u( i ) o find u( i+ ). Convenional parallelizaion of his procedure disribues he componens of u amongs differen processors, and all processors updae he values of u assigned o hem, a each ime sep. In order o perform his updae, each processor compues he componens of f for he componens of u assigned o i. They will need some componens of u assigned o oher processors, and hus communicaion is required. In a disribued memory compuing environmen, he communicaion cos is a leas he order of a microsecond using a fas nework such as Infiniband. In he shared memory programming paradigm on symmeric muliprocessor sysems, he hread synchronizaion overhead is he order of a microsecond. Therefore, he compuaion involved on a processor should be large enough ha he parallel overhead does no become a boleneck. When he sae space is small, he communicaion overhead dominaes he compuaion, and parallelizaion is no feasible. Thus any improvemen

2 over he sequenial performance is of much benefi in such siuaions. For small ODE sysems, he soluion ime is significan when he number of ime seps is large. Parallelizaion of he ime domain appears promising hen. Dynamic ieraions are a class of mehods ha are ofen suiable for ime parallelizaion. Here, we sar ou wih an iniial guess for he enire ime domain, and hen ieraively improve i unil i converges. Picard ieraions, which pre-dae modern ODE solvers, are easy o parallelize along he ime domain each processor updaes a differen ime inerval independenly, followed by a parallel prefix compuaion o correc he updae. This mehod is explained furher in. Dynamic ieraions were well sudied in he 98s and 99s. However, hey were no suiable for mos realisic problems due o heir slow convergence. They also have a high memory requiremen, because he soluion a all ime seps in he previous ieraion is needed for he updae in he nex ieraion. In his paper, we propose a hybrid scheme ha combines convenional sequenial ODE solvers wih dynamic ieraions. We show ha he convergence is faser wih his scheme. This scheme also guaranees progress in each ieraion, a leas one ime inerval converges. Thus, in he wors case, he oal compuaion ime is no worse han ha of a sequenial solver, apar from a small parallelizaion overhead. Of course, we hope o improve on a sequenial solver in a ypical case. We show laer ha we obain significan improvemen in performance on some sample ODE sysems, compared wih an equivalen convenional solver. We also compare i wih Picard ieraions. The hybrid mehod converges abou wice as fas as Picard ieraions. The ouline of he res of his paper is as follows. In we explain dynamic ieraions, and Picard ieraions in paricular. We hen describe he new approach in 3 and discuss is correcness and convergence. We demonsrae is effeciveness in 4. We discuss oher ime parallelizaion echniques in 5. We summarize our conclusions and presen fuure work in 6.. Dynamic Ieraions The modern developmen of dynamic ieraions for he soluion of Iniial Value Problems sared wih he work on Waveform Relaxaion for he soluion of large differenialalgebraic equaions in VLSI circui simulaions [7]. Many varians of he original mehod were hen developed in he 98s and 99s. A unified definiion of his class of mehods is presened in [3]. I is also shown here ha Picard ieraions, which pre-dae waveform relaxaion by almos a cenury, can be considered a special case of his class of mehods. We use he noaion of [3] in he descripion below. Consider an iniial value problem, u = f(, u), u() = u, () where u R n, f : R R n R n, and u := du/d. Dynamic Ieraions solve () recursively using he following equaion. g(, u m ) = f(, u m ), () = u. () The recursion is normally sared using he iniial guess u () = u. If g is defined such ha g(u, u) =, hen i can be shown ha if he ieraion converges, hen i converges o he exac soluion, provided () is solved exacly [3]. In pracice, numerical mehods are used o solve (), and he approximaions used affec he resuls, as shown in 4. Differen mehods differ in heir choice of g, which influences he convergence behavior and also he ease of parallelizaion. Picard ieraions choose g(y, z) =. We discuss oher common choices of g in 5. Dynamic ieraions converge under fairly weak condiions [3], and so can be applied in mos pracical siuaions. The Picard mehod can also be parallelized efficienly, as explained below. Picard ieraions recursively solve () as follows. Iniially, u is sored a discree ime poins. In he (m+)h ieraion, f(, u m ) is compued a hose ime poins. Then f is numerically inegraed using some quadraure mehod. This leads o he following expression for deermining, when he inegraion is exac. u() m+ = u + f(s, u m (s)) ds. (3) The above equaion can be rewrien as follows. i+ = um+ i + i+ i f(s, u m (s)) ds. (4) Here, i is a sequence of ime poins. There may be several ime poins a which u m are known, in he ime inerval [ i, i+ ]. The inegral in (4) is evaluaed using a quadraure rule. The above mehod is parallelized easily. Each processor i, i < P is responsible for deermining he values of in he inerval ( i, i+ ], using he values of u m. Each of hem firs evaluaes he inegrals in (4) for heir ime poins using a quadraure mehod, for which hey will need some boundary values of u m available on neighboring processors. Then, a parallel prefix compuaion on he inegral for i s is performed o compue he cumulaive inegrals from o i. Afer his, each processor can independenly compue he values of for which hey are responsible. If he ime inerval ( i, i+ ) conains sufficien ime poins a which u are evaluaed, hen he compuaion cos will be much larger han he prefix cos, and hus he parallelizaion will be efficien. While he parallel overhead is ofen small, he efficiency of parallelizaion of any dynamic ieraion should be compared agains an equivalen convenional ODE solver (ha is, having he same error order as he quadraure), raher han agains he dynamic ieraion on one processor, in order o ge an idea of he effeciveness of dynamic ieraions.

3 Under his crieria, noe ha if he compuaions converge in N ieraions, hen he efficiency canno be greaer han /N. For example, N = 5 will lead o relaive efficiency less han %. Thus, dynamic ieraions are suiable when convenional mehods give poor parallel efficiency. Dynamic ieraions have generally no found widespread use because hey were slow o converge, making hem uncompeiive wih convenional solvers. For example, fig. shows four ieraions being needed for convergence (jus from a visual perspecive) for a simple equaion. However, we believe ha wih he populariy of mulicore processors, i is ime o re-evaluae he usefulness of dynamic ieraions. As menioned earlier, we now have a siuaion where users can solve small problems in parallel. Convenional solvers are no effecive for his, and so any gain in speedup is a benefi, even if he efficiency is low. For example, an efficiency of % on eigh cores gives a speedup of.6 over a convenional solver. 3. A Hybrid Scheme We will presen our mehod as an inuiive improvemen o he Picard mehod o deal wih one of he laer s shorcoming, and hen we will analyze is convergence properies. Apar from slow convergence, anoher problem wih he Picard mehod, of somewha less imporance, is is large memory requiremen. In order o simulae n ime seps, he value of u m needs o be sored for he n ime poins. Usually one divides he oal ime span ino smaller windows, and wais for convergence of he enire window before moving o he nex window. However, he number of ime seps in each window can sill be high. This can affec performance hrough increased cache misses. In he proposed mehod, only he resuls a cerain ime poins i are sored. The ime seps are usually small for sabiliy and accuracy reasons. However, because he resul of he exac soluion a each ime sep is no always required, we need no o sore he resuls a all inermediae ime seps, whereas Picard ieraions need he inermediae values inside each inerval [ i, i+ ]. Of course, he granulariy of he i s in he hybrid mehod should be a leas as fine as he resoluion desired by he user. We firs noe ha he exac soluion of eqn. () is given by he following. u i+ = u i + i+ i f(s, u(s)) ds. (5) If we compare wih he Picard recurrence, eqn. (4), we see ha Picard ieraions replace u wih u m in he inegral, because ha is he laes approximaion o u available o i. In our mehod, we have u m available only a he poins i. In our mehod, processor i will solve he exac ODE eqn. () wih iniial condiion û (i) ( i ) = u m i, up o ime i+. The ha on u indicaes ha his is a soluion o he exac ODE wih a possibly wrong iniial condiion, and he subscrip (i) indicaes ha his problem is solved on processor i. Noe ha û (i) ( i+ ) = û (i) ( i ) + i+ i f(s, û (i) (s)) ds. Thus we ge he following expression. û (i) ( i+ ) u m i = i+ i f(s, û (i) (s)) ds, (6) where we have used he iniial condiion û (i) ( i ) = u m i. We replace he inegral in he Picard ieraions recurrence, eqn. (4) wih he inegral above, and his inegral, in urn, can be replaced by he lef hand side of he expression above o give he following recurrence for he hybrid mehod. i+ := um+ i + û (i) ( i+ ) u m i. (7) The parallelizaion is similar o ha of he Picard mehod, and is shown in Algorihm 3.. The û (i) ( i+ ) are compued by solving an ODE, independenly on each processor. Then a parallel prefix compuaion is performed as in he Picard mehod and he new approximaion i+ is compued. An inuiive way of hinking abou his mehod is as follows. If u m i were accurae, hen û i ( i+ ) would be he exac soluion for ime i+. However, we observe an error in he value of i u m i beween wo successive ieraions a ime i. We add his as a correcion facor o û i ( i+ ). Algorihm 3.: HYBRID-METHOD(u i = u, Number of processors P) repea for each processor i {..P } û m (i) ( i+) Solve ODE do (wih iniial condiion u do m i ) i û m (i) ( i+) u m i G i Parallel prefix on i i+ G i + u unil Convergence We nex show ha if he ieraions converge, hen hey converge o he exac soluion, provided he ODE is solved exacly in each ieraion. Of course, he ODE is acually solved numerically, wih approximaion errors. We evaluae he effec of hese in 4. Assume ha he ieraions converge. A convergence, i = u m i, for i [, n]. From eqn. (7), we ge i+ = û (i) ( i+ ). Processor sars from he exac iniial condiion, and hus = û ( ) is exac. Bu = u m due o convergence, and hus u m is exac. Thus process sared from he exac soluion in deermining û ( ) and hus = û ( ) is exac. We can proceed in his manner, using an inducive argumen o show ha is exac.

4 We now consider he issue of how fas he mehod converges. We firs explain why we inuiively expec i o have beer convergence han Picard ieraions. We hen prove ha i converges in a finie number of seps, if he ODE soluion is exac. We will demonsrae in 4. ha i has good convergence properies in pracice. We saw above ha he difference beween Picard ieraions and he hybrid mehod is in how he erm i+ i f(s, u(s)) ds is esimaed, wihou knowing he rue soluion u. Picard ieraions perform a quadraure using he approximaion u m. The hybrid mehod sars wih a poin on u m, bu uses he exac differenial equaion. Thus, we can expec is soluion o be beer han ha of Picard ieraions, which use he old approximaion hroughou he ime domain. We also show ha he hybrid mehod always progresses. Tha is, in each sep, a leas one addiional ime inerval converges. Noe ha processor sars wih he exac iniial condiion and hus converges afer he firs ieraion. In he second ieraion, he iniial condiion for processor is exac, and so is soluion u is exac. Using an inducive argumen, u i i is exac. Thus he mehod converges in n ieraions if here are n ime inervals. Of course, his would no give any speedup over an equivalen sequenial solver, and we would hope o do beer in pracice. However, i does provide a guaranee ha in he wors case, he performance will no be worse han an equivalen convenional sequenial solver, if he parallelizaion overheads are small. Noe ha he memory consumpion of his solver on each processor is small roughly he same as a sequenial solver. I is independen of he number of ime seps in a ime inerval, unlike dynamic ieraions. We consider hree varians of his mehod. (i) We solve for he enire ime domain by dividing he ime span T ino T/P inervals on P processors. (ii) We use windows of a cerain widh W, wih he number of windows being much larger han P. The processors wai for convergence of he firs P windows (one window per processor) before proceeding wih he nex se of P windows. This ype of windowing has been shown by ohers o be more effecive han solving for he enire ime domain in dynamic ieraions [6]. (iii) We noice ha afer he firs few processors have converged, hey don perform useful work by repeaing he calculaions. So, hese processors will be used for compuing new windows. We call his mehod sliding windows. Noe ha a processor canno be considered as having converged unil i has locally converged and all he processors handling smaller values of ime have also converged. 4. Empirical Evaluaion 4.. Experimenal Seup The experimenal ess were performed on an Inel Xeon cluser. Each node has wo.33 GHz Quad-core CPUs wih 8 GB shared memory. OpenMPI. is used for iner-process communicaions. Boh he sequenial and parallel code are compiled wih opimizaion level O3. All he speedup and relevan resuls are obained on 8 cores on a single node. We use an assembly iming rouine, which accesses he ime-samp couner hrough he rdsc insrucion, for our iming resuls. The resoluion of he imer is.5µs. I is sufficienly good for he iming purpose in his paper, where he iming resuls are a leas he order of a few microseconds. The resuls are presened by solving hree ODE sysems. They are: (i) ODE: u = ω u, wih ω =, u() =, u () = for [, 6.4]; (ii) ODE Airy equaion: u = k u, wih k =, u( 6) =, u () = for [ 6, ]; (iii) ODE3 Duffing equaion: u + δu + (βu 3 + ω u) = γ cos(ω+φ), wih γ =.3, φ =, δ =., β =.5, ω =, u( 6) =, u () = for [ 6, ]. The Airy equaion is commonly seen in physics and asronomy. The Duffing equaion describes he moion of a damped oscillaor. In his paper, a single sep 4h order Runge-Kua Mehod is implemened as he ODE solver used in he hybrid mehod. We use he Simpson s rule o evaluae he quadraure in he Picard mehod. The accuracy of he Simpson s rule is also order 4. All compuaions were in double precision. The exac soluions o each of he hree ODEs are shown in fig., 3, 4. The exac soluion o he firs ODE can be expressed in he simple symbolic form, y = sin(ω). However, here are no closed-form soluions o he second and hird ODEs in he real domain. In order o ge good approximaions o he exac soluions, we used he following procedure. We solved he ODEs using a sequenial ODE solver. We hen solved hem again wih a ime sep size ha was an order of magniude smaller, and deermined he larges difference in values a any given ime poin. We repeaed his wih smaller ime seps, unil his larges difference was he order of. This soluion was hen used as an approximaion o he exac soluion. In 4., we chose he ime sep size such ha he global error is of he order of 6. Therefore, he accuracy in he approximaed exac soluion is sufficien for he comparisons in his paper. In he resuls below, we use he erms olerance and error. The laer refers o he absolue difference beween he exac soluion and he compued soluion. The olerance is a parameer o he ODE solver. When using dynamic ieraions, he soluion is considered o have converged if he absolue difference beween soluions a successive ieraions is smaller han his value. When used wih a sequenial ODE

5 y.8.6 y Figure. Convergence of Picard ieraion for he soluion of u = u, u = Figure 3. ODE: y = k y 4.5 y y Figure. ODE: y = ω y solver, he olerance has he convenional meaning. The error may be larger han he olerance. The speedup resuls compare he ime of dynamic ieraions wih he ime for a sequenial 4h order Runge- Kua Mehod. Speedup is close o linear when we compare wih a sequenial implemenaion of dynamic ieraions. However, in order o realisically evaluae he usefulness of dynamic ieraions, we need o compare hem agains a convenional solver. The oal compuaional work of a sequenial solver is less han ha of dynamic ieraions, and so he speedup resuls are lower han if compared wih a sequenial implemenaion of dynamic ieraions Figure 4. ODE3: y + δy + (βy 3 + ω ) = γ cos(ω + φ) 4.. Experimenal Resuls We firs explain our choice of ime sep size. If he ime sep size is smaller han necessary, hen he granulariy of he compuaion is coarser han necessary, and hus he communicaion overheads are arificially made small, relaive o he compuaional effor in is parallel implemenaion. In order o avoid his, we firs evaluae he accuracy of he sequenial ODE solver for differen ime sep sizes, and choose one ha gives good accuracy. For sabiliy, a sep size of.5 is sufficien. However, a much smaller sep size is usually required for accuracy. Fig. 5 shows he comparison of he errors from ode45 funcion in Malab (wih parameers of absolue error = 6 and relaive error = 6 ) and dlsode funcion in ODEPACK. We need a sep size of 7 for he

6 4 5 4 ODE ODE ODE3 Errors 6 8 RK4 sepsize=e 7 RK4 sepsize=e 6 ode45 (Malab) ODEPACK Figure 5. Error versus ime for differen values of sepsize h using he sequenial solver for ODE. Errors are compared wih ha of ode45(malab) and dlsode(odepack). Speedup 4 3 Wihou Windowing Wih Windowing Sliding Window ODE ODE ODE3 Figure 6. Speedup resul of he hree ODEs using differen varians of he hybrid mehod. The window size is 5 ime seps and he olerance is 6. sequenial ODE solver o have he same order of accuracy as ode45 in Malab when solving ODE. Similar experimens were performed for ODE and ODE3 and heir sep sizes were also deermined as 7. As menioned in 3, he hybrid mehod can have hree differen varians wihou windowing, wih windowing, and wih sliding windows. We wish o compare he effeciveness of each of hese varians. Fig. 6 shows he speedup resuls using he hree varians on he ODEs. The window size is 5 ime seps for boh he windowing mehods. We can see ha sliding windows yields a beer speedup han fixed No. of Ieraions per Window 3 Tol= Tol= Window Size (no. of ime seps) Figure 7. Average number of ieraions using differen window sizes and olerances wih sliding windows. windows. The non-window mehod does no perform well for such small numbers of processors, because i requires more han eigh ieraions, on eigh cores, o converge for he given ime span. Given hese observaions, we will only use resuls from he sliding windows mehod for comparisons wih he sequenial solver as well as wih Picard ieraions, in he following discussions. The hybrid mehod wih sliding windows requires hree MPI funcion calls per ieraion. MPI Scan is used for he parallel prefix summaion, MPI Allreduce is called in esing for convergence, and MPI Isend/MPI Recv is used o ransfer he laes updaed value beween cerain processors. Across 8 cores on he same node, MPI Scan and MPI Isend/MPI Recv ake less han µs and MPI Allreduce akes less han µs. The compuaion ime per sep for he hree sample ODEs are.µs,.µs and.44µs. For speedup resuls shown in he following discussions, he window size is a leas 3 ime seps. In ha case, he granulariy of he compuaion is of he order of a hundred microseconds, and he overall communicaion overhead a mos he order of %. We now discuss he choice of window size. As menioned before, he window size affecs he granulariy of he parallelism. For exac ODE solvers, smaller granulariy indicaes relaively larger communicaion overhead. Thus he speedup can be expeced o decrease wih decrease in window size. In he hybrid mehod, he window size can also affec he number of ieraions as well as he numerical accuracy of he soluion. We experimened wih hree differen window sizes, 5 ime seps, 4 ime seps and 3 ime seps. Fig. 7 compares he average number of ieraions per window for he soluions o converge. As we can see, if we use he same order of error olerance for all he window sizes, he

7 6 5 ODE ODE ODE3 window size= 3 window size= 4 window size= 5 Speedup 4 3 Tol= Tol= Errors 3 5 Window Size (no. of ime seps) Figure 8. Speedup resuls using differen window sizes and olerances wih sliding windows Figure. Errors using differen window sizes in he Picard mehod wih olerance equal o o solve for ODE3. Errors 4 6 window size= 5, olerance= 6 window size= 4, olerance= 6 window size= 3, olerance= window size= 3, olerance= 6 window size= 4, olerance= 6 4 Figure 9. Accuracy of he soluion using differen window sizes and olerances for ODE3. (The black solid line almos overlaps wih he red dashed line.) smaller he window size is, he faser he mehod converges. Fig. 8 compares he speedup when differen window sizes and olerance are used o solve he hree ODEs. However here is a rade-off wih he accuracy of he soluion. Fig. 9 shows how he accuracy is affeced by differen window sizes and local error olerances. The error is abou wo orders larger when he window size is decreased by one order of magniude, when he olerance is 6. By comparing he effec of differen window sizes and olerances on boh speedup and accuracy, we chose he window size of 3 ime seps. Similar conclusions were also drawn for ODE and ODE. The window size also affecs he accuracy and he convergence rae of Picard ieraions. Fig. shows he error versus he window size for ODE3. The sep size used for he quadraure inegraion is he same as ha of he ODE solver for our modified mehod. The solid line uses he same window size of 3 and olerance of. As we can see he accuracy is abou order lower han ha of our mehod. Fig. shows he comparison of he number of ieraions of he Picard mehod o he hybrid mehod for he equivalen parameers of window size, sepsize and olerance. The y-axis is he raio of he number of ieraions required by Picard ieraions versus he number of ieraions required by he hybrid mehod. The convergence rae is slower in Picard ieraions in all he hree cases. However, noe ha his improvemen in convergence rae does no direcly ranslae o a proporional improvemen in speedup, because he Runge-Kua mehod evaluaes he ODE a 5 poins for each ime sep, while he Picard ieraion, using Simpson s rule, only calls he funcion f in equaion () once for each ime sep. 5. Relaed Work We discussed dynamic ieraions using he Picard mehod in. Oher mehods can be derived by a differen choice of g in equaion (). Two imporan cases are he Waveform Jacobi mehod and he Waveform Gauss-Seidel mehod. The definiions of g for hese are provided in [3]. The Waveform Jacobi recurrence is like ha of equaion (3), excep ha in he evaluaion of he j h componen of f, he j h componen of u m is replaced by he j h componen of. This decomposes he ODE wih n variables ino n ODEs where one variable is updaed in each

8 Average number of ieraions per window (Picard ieraion/hybrid mehod) 3 Tol= Tol=e 6 Window size (no. of ime seps) ODE ODE ODE3 Tol= Figure. Raio of he average number of ieraions. Picard mehod vs. he hybrid mehod compared for he same window size and olerance. ODE. In he evaluaion of f, he value of u m (s) is used for all oher variables. The Waveform Gauss-Seidel recurrence is like ha of equaion (3), excep ha in he evaluaion of he j h componen of f, all componens of u m wih index less han or equal o j are replaced by he corresponding componens of. This decomposes he ODE wih n variables ino n ODEs where one variable is updaed in each ODE. The ODE for he firs componen is solved as in Waveform Jacobi. The ODE for he second componen uses he values of he firs componen of (s) evaluaed before, and he values of u m (s) for he hird and higher indexed componens, in each evaluaion of f. This process coninues, wih he ODE for each componen using he values of (s) for componens wih smaller indices, and u m (s) for componens wih larger indices, in each evaluaion of f. Noe ha, despie he use of an ODE solver insead of quadraure by hese mehods, hey are quie differen from he hybrid-scheme. Their basic goal is o decouple he compuaion of each componen of he sysem. They have high memory requiremen as ha of Picard ieraions, because u m needs o be sored for all ime seps. Each ieraion of Waveform Jacobi can be parallelized by having he ODE for each componen solved on a differen processor. This is advanageous for large sysems, especially when differen ime seps can be used for differen componens. However, he communicaion cos can be high, because he componens of u a all ime poins will need o be sen o oher processors. Parallelizing i along he ime domain is difficul in general. Some special cases can, however, be parallelized in ime [3]. A wavefron approach is used o parallelize Waveform Gauss-Seidel [3]. The idea behind his is basically pipelining. Noe ha once he firs ime sep of he firs componen of has been compleed, he firs ime sep of he second componen can be sared, while he second ime sep of he firs componen is simulaneously sared. If we need o perform m ieraions wih d variables for n ime seps, hen he m n d evaluaions of f are performed in parallel in d+m n ime, if d n. This mehod appears o parallelize well. However, noe ha a sequenial ODE solver can solve he same problem wih n evaluaions of f. Thus, even wih parallelizaion, his mehod is no faser han an equivalen convenional solver. The above processes of spliing he ODE ino subcomponens can be generalized by keeping blocks of variables ha are srongly coupled ogeher. The blocks may also overlap [4]. When () is solved exacly, he soluion converges superlinearly [3], [] on a finie ime inerval. However, numerical mehods have o be used o solve i in pracice. Theoreical resuls on convergence for discree version of dynamic ieraions are presened in [], [3]. Esimaes of suiable window sizes for good convergence are provided in [5]. In realisic siuaions, his class of mehods has ofen been found o converge slowly, which is a major limiaion of hese mehods in pracice. Differen sraegies have been proposed o speed up he convergence of dynamic ieraions. Coarse ime sep size for he earlier ieraions, and finer ime seps for laer ieraions, is proposed in []. A wo-sep-size scheme for solving Picard ieraions is proposed in [6]. A fine sep size is used for a fixed number of ieraions in all he sub-sysems o smooh he ieraions. A coarse sep size is hen used o solve he residue of he soluion resriced o he coarse ime-mesh. Muligrid echniques have also been used for acceleraing convergence [8], [7]. Reduced order modeling echniques have been combined wih dynamic ieraions, in order o help propagae he change in one componen of he sysem faser o he res of he sysem in []. A differen approach o ime parallelizaion uses a coarsegrained model o guess he soluion, followed by correcion. The Parareal mehod can be considered an example of his [], [9]. Dynamic daa-driven ime parallelizaion [4], [5], [8] uses resuls from prior, relaed, runs o consruc a coarse-grained model. Convenional parallelizaion of ODEs is hrough disribuion of he sae space, as menioned earlier. The difficuly wih small sysems is ha he compuaion ime per ime sep is small. For example, each sep in our compuaion akes he order of a enh of a microsecond using he. However, his mehod may be useful for oher reasons, such as when widely differen ime seps can be used for differen componens. Bu he order in which variables are solved has o be chosen carefully, o preven he variable wih small imes seps from becoming a boleneck o variables ha depend on i [3].

9 Runge-Kua mehod. This will be even smaller when he sae space is disribued across he cores. On he oher hand, MPI communicaion overhead is of he order of microseconds. One could also use hreads on mulicore processors. However, he hread synchronizaion overhead is hen significan, as menioned earlier. In fac, a parallel OpenMP implemenaion ook almos an order of magniude more ime han a sequenial ODE solver, for he hree ODEs considered in his paper. 6. Conclusions We have shown ha hybrid dynamic ieraions yield significanly beer performance han an equivalen sequenial mehod. This is imporan because convenional parallelizaion is no feasible for small ODEs. The loss in efficiency in he hybrid scheme is primarily due o he number of ieraions required for convergence. The oher parallelizaion overheads are small. Thus, hey provide subsanial benefi when effecive, and are no much worse han he sequenial compuaion when less effecive. We have considered hree varians of his mehod: (i) solving for he enire inerval, (ii) using windows, and (iii) using sliding windows. The former wo have been sudied by ohers for dynamic ieraions (see [5], [], [6] for windowing). We demonsraed hrough he speedup resuls ha he parallel efficiency of he hird is beer. We also showed ha he convergence behavior of he hybrid mehod is beer han Picard ieraions. Furhermore, our mehod has smaller memory requiremens han dynamic ieraions. This can lead o beer cache performance. This paper has demonsraed he poenial of hybrid dynamic ieraions in providing a scalable soluion for small ODE sysems on modern compuing plaforms. We propose he following fuure work o furher esablish his echnique. We are currenly working on compleing he following simple exensions of his paper. (i) We are evaluaing more ODEs. (ii) We are implemening his approach on GPUs (including double precision) and on he Cell. The hybrid mehod s smaller memory requiremen is especially useful on hese plaforms because of he small local sore on he Cell and he small se of regisers and shared memory on he GPU. (iii) We plan o replace he Runge-Kua mehod, which requires muliple evaluaions of f a each ime sep, by a muli-sep mehod, which requires only one funcion evaluaion a each sep. Our plans for he fuure are as follows. (i) We wish o use an adapive ODE solver insead of he fixed ime-sep Runge-Kua mehod ha we have currenly used. This may require load balancing, when he ime seps sizes on differen processors are differen. (ii) Our error olerance condiion is currenly very simple, and jus based on he absolue difference beween differen ieraions. Convenional solvers use error esimaes o deermine he olerance. We need o develop heoreical resuls o enable us o esimae he error, based on he ime seps size, olerance, and window widh. We are also planning a Malab inerface so ha he Malab ode45 solver can be replaced by our solver when a GPU is available. Acknowledgmens This work was funded by NSF gran # DMS 668. We also hank Xin Yuan a Florida Sae Universiy for permiing use of his Linux cluser. A.S. expresses his graiude o Sri Sahya Sai Baba for his help and inspiraion. References [] L. Baffico, S. Bernard, Y. Maday, G. Turinici, and G. Zerah. Parallel-in-ime molecular-dynamics simulaions. Physical Review E (Saisical, Nonlinear, and Sof Maer Physics), 66: ,. [] M. Bjorhus. A noe on he convergence of discreized dynamic ieraions. BIT Numerical Mahemaics, 35:9 96, 995. [3] C. W. Gear. Waveform mehods for space and ime parallelism. Journal of Compuaional and Applied Mahemaics, 38:37 47, 99. [4] R. Jelsch and B. Pohl. Waveform relaxaion wih overlapping spliings. SIAM Journal on Scienific Compuing, 6:4 49, 995. [5] B. Leimkuhler. Esimaing waveform relaxaion convergence. SIAM Journal on Scienific Compuing, 4:87 889, 993. [6] B. Leimkuhler. Timesep acceleraion of waveform relaxaion. SIAM Journal on Numerical Analysis, 35:3 5, 998. [7] E. Lelarasmee, A. E. Ruehli, and A. L. Sangiovanni- Vincenelli. The waveform relaxaion mehod for imedomain analysis of large scale inegraed circuis. IEEE Transacions on Compuer-aided Design of Inegraed Circuis and Sysems, :3 45, 98. [8] Ch. Lubich and A. Osermann. Muli-grid dynamic ieraions for parabolic equaions. BIT Numerical Mahemaics, 7:6 34, 987. [9] Y. Maday and G. Turinici. Parallel in ime algorihms for quanum conrol: Parareal ime discreizaion scheme. Inernaional Journal of Quanum Chemisry, 93:3 38, 3. [] O. Nevanlinna. Remarks on Picard Lindelöf ieraion. BIT Numerical Mahemaics, 9:38 346, 989. [] M. Rahinam and L. R. Pezold. Dynamic ieraion using reduced order models: A mehod for simulaion of large scale modular sysems. SIAM Journal on Numerical Analysis, 4: ,. [] J. Sand and K. Burrage. A Jacobi waveform relaxaion mehod for ODEs. SIAM Journal on Scienific Compuing, :534 55, 998.

10 [3] L. F. Shampine. Numerical Soluion of Ordinary Differenial Equaions. Chapman & Halll, 994. [4] A. Srinivasan and N. Chandra. Laency olerance hrough parallelizaion of ime in scienific applicaions. Parallel Compuing, 3: , 5. [5] A. Srinivasan, Y. Yu, and N. Chandra. Applicaion of reduced order modeling o ime parallelizaion. In Proceedings of HiPC 5, Lecure Noes in Compuer Science, volume 3769, pages 6 7. Springer-Verlag, 5. [6] J. Sun and H. Grosollen. Fas ime-domain simulaion by waveform relaxaion mehods. IEEE Transacions on Circuis and Sysems - I: Fundamenal Theory and Applicaions, 44:66 666, 997. [7] S. Vandewalle and R. Piessens. On dynamic ieraion mehods for solving ime-periodic differenial equaions. SIAM Journal on Numerical Analysis, 3:86 33, 993. [8] Y. Yu, A. Srinivasan, and N. Chandra. Scalable imeparallelizaion of molecular dynamics simulaions in nano mechanics. In Proceedings of he 35 h Inernaional Conference on Parallel Processing (ICPP), pages 9 6. IEEE, 6.

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Simulaion-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Week Descripion Reading Maerial 2 Compuer Simulaion of Dynamic Models Finie Difference, coninuous saes, discree ime Simple Mehods Euler Trapezoid