2. One-To-All Broadcast and All-To-One Reduction. 1. Chapter 4 : Efficient Collective Communication

Size: px

Start display at page:

Download "2. One-To-All Broadcast and All-To-One Reduction. 1. Chapter 4 : Efficient Collective Communication"

Fay Preston
5 years ago
Views:

1. Chater : Efficient Collective Communication Collective communication: comm amongst collection of nodes (not just sender & recver.

Exloit known synchronization to reduce total time in alg 3. Min total time in alg, assuming roughly synced nodes As always, kee our eye on two (ossibly conflicting aims: 1.

Avoid contention (min # of mesgs, or use art comm attern Avoiding cont req knowledge of interconnects amongst coll We assume cut-through ( acket routing, so comm time: t c = t s +mt w, t s : mesg

One-To-All Broadcast and All-To-One Reduction bcast: Source node has m-length buffer needed by all other nodes reduction: All nodes contribute m-length buffers which are combined (eg.

1 1. Chater : Efficient Collective Communication Collective communication: comm amongst collection of nodes (not just sender & recver. One-to-all (bcast, all-to-one (reduc, all-to-all, scatter/gather, etc. Otimization can have different goals (default is last: 1. Minimize articular node s time in collective communication. Exloit known synchronization to reduce total time in alg 3. Min total time in alg, assuming roughly synced nodes As always, kee our eye on two (ossibly conflicting aims: 1. Min time by using as many links as ossible (eg., send more messages. Avoid contention (min # of mesgs, or use art comm attern Avoiding cont req knowledge of interconnects amongst coll We assume cut-through ( acket routing, so comm time: t c = t s +mt w, t s : mesg startu time t w : erword transfer time (inverse of bandwidth in words m: mesg size in words. One-To-All Broadcast and All-To-One Reduction bcast: Source node has m-length buffer needed by all other nodes reduction: All nodes contribute m-length buffers which are combined (eg., sum to a final m-length buffer left on the destination node These os are duals: can run bcast in reverse to create reduction Now discuss efficient imlement on various canonical interconnects: Note that row/col of -D mesh is ring or linear array Perfectly algorithm uses all links all time Not usually ossible (eg bcast: only 1 roc busy in 1st ste Will want to send as many as ossible, w/o contention When we get answer to all nodes (in collection in minimal ossible time, we call this the minimum sanning tree 3. Basics of Broadcast For bcast where hardware can handle 1 send at a time, the min sanning tree is given by recursive doubling: All nodes ossessing mesg send to a node that doesn t At each stage, number of senders doubles Takes log ( stes, wt i senders in ste i (0 i < log ( In final ste, / links active, even ring has this many links! Have ordering choices avoid cont, snd to furthest node 1 st For Reduction, just reverse the stes and add combine oeration! Cost: log ((t s +t w m [log ((t s +(t w +t o m]. Recursive Doubling for Mesh Interconnects On -D mesh, utilize ring-based rec doubling within row (col and then have all col (row do same in arallel. Row/col of square -node mesh linear array Cost: log ( (t s +t w m = (log ( + log ( (t s + t w m = log ((t s +t w m log b (x+log b (y = log b (xy Use -D alg to avoid contention, not to change # of stes! 3-ste alg for 3-D mesh Simly reverse direction for reduction Intro to Parallel Comuting, Fig.5, g 153 contention-free -D mesh bcast contention-free bcast contention-free reduction

5. Recursive Doubling on Hyercube Interconnects Hyercube wt d nodes is d-dimensional mesh with nodes in each dim: Aly tt comm along each d link, alg done Unlike mesh/ring, all orderings work, since

$General One-to-All Broadcast Algorithm void oneall(int d, int Iam, int src, int n, void *x { { viam = Iam ^ src; mask = (1<<d - 1; for (i=d-1; d >= 0; d--{ abit = (1<<i; mask ^= abit; if ((viam &$

2 5. Recursive Doubling on Hyercube Interconnects Hyercube wt d nodes is d-dimensional mesh with nodes in each dim: Aly tt comm along each d link, alg done Unlike mesh/ring, all orderings work, since hy has links dim; however, other than ordering & hos, no better than cheaer mesh [log ((t s +mt w ]! Hy bcast works for indirect connection using balanced binary tree Intro to Parallel Comuting, Fig.6, g 15 Intro to Parallel Comuting, Fig.6, g General One-to-All Broadcast Algorithm void oneall(int d, int Iam, int src, int n, void *x { { viam = Iam ^ src; mask = (1<<d - 1; for (i=d-1; d >= 0; d--{ abit = (1<<i; mask ^= abit; if ((viam & mask == 0{ if ((viam & abit == 0{ vdest = viam ^ abit; send(vdest^src, n, x; else { vsrc = viam ^ abit; recv(vsrc^src, n, x; mask indicates who is allowed in: Init to all 1s, so no one Each iter removes most sig bit restriction, adding new nodes (recursive doubling! Of active nodes, 1/ sending, others recving: active bit = 0, send active bit = 1, recv Can get send-to-nearest by reversing i loo (OK hy, no mesh (x ^ y ^ y = x contention-free Hyercube bcast contention-free binary tree bcast One alg (adated gives content-free 1-to-all bcast ring, mesh, hy Since viam allows easy conversion between node-0 src/dest, use node-0 in future for clarity! 7. Recursive Halving All-to-One Reduction (dest=0 Easy to build recursive halving all-to-one reduction by reversing bcast: void AllOneReduce(int d, int Iam, int m, TYPE *X { mask = 0; // all nodes start for (i=0; i < d; i++ { abit = (1<<i; if ((Iam & mask == 0 { if (Iam & abit!= 0 { dest = Iam^abit; send(dest, m, X; else { src = Iam^abit; recv(src, m, buff; for (k=0; k < m; k++ X[k] += buff[k]; mask = abit; Can use ^= or = or += for mask udate by abit Since bcast attern reversed: Odd nodes are senders Recvers stay in alg Need m-buff to recv other s X X garbage on non-dest nodes Can use another buff to avoid Can use other reduction oerators (eg. min, max Can use async recv & extra buff to overla comm & com Comlicates simle alg! 8. All-to-All Broadcast and All-to-All Reduction In all-to-all bcast, all nodes have unique mesg to share wt everyone In all-to-all reduction, each node has mesgs which should be reduced to a different rocessor (eg., mesg 0 for 0, mesg 1 for 1 Ex.: each nodes comutes a ortion of C in gemm, and then all-to-all reduce uts answer on roc owning art blk of C In all-to-all, everybody has something so can do comm er ste (max, assuming 1 send a time In one-to-all, not all nodes had data, so max comm / If we did one-to-all, would never use links in Better to have secial all-to-all than do one-to-all We assume each node can send & recv at same time Intro to Parallel Comuting, Fig.8, g 157

l e f t = (+Iam 1% right = (Iam+1% ; 5 resbuff = buff ; 6 for ( i =1; i < ; i++{ 7 send buff to right 8 recv buff frm l e f t 9 resbuf = resbuf U msg; 0 1 return ( resbuf ; May hang if send out of

3 9. No Contention, Max Link All-To-All Ring Broadcast 10. No Contention, Max Link All-to-All Ring Reduction Ste 1: Send my mesg right, recv left Ste i: Send mesg recv last time right, recv left At ste 1, all nodes have mesgs 1 func allallbc ring ( buff { 3 l e f t = (+Iam 1% right = (Iam+1% ; 5 resbuff = buff ; 6 for ( i =1; i < ; i++{ 7 send buff to right 8 recv buff frm l e f t 9 resbuf = resbuf U msg; 0 1 return ( resbuf ; May hang if send out of buff Usually, send/recv from areas in resbuf, do async recv, sync recv resbuf m size Cost: ( 1(t s +mt w Contention for linear array! Intro to Parallel Comuting, Fig.8, g 157 All stes excet 1 st take data from receeding ste, add it to buff of the node that message is destined for, and forward it on: 1 void aared ring ( int m, TYPE buffs { rbuff = alloc (m; 3 l e f t = (+Iam 1% ; right = (Iam+1% ; 5 for ( i =1; i < ; i++ { 6 dest = (Iam+i % ; 7 i f ( i!= 1 { 8 for (k=0; k < m; k++ 9 buffs [ dest ] [ k ] += rbuff [ k ] ; arecv rbuff from right 1 send buffs [ dest ] to l e f t 13 1 for (k=0; k < m; k++ 15 buffs [ Iam ] [ k ] += rbuff [ k ] ; 16 (#, dest = # overwrites buffs wt artial results Cost: ( 1(t s +m(t w +t o May be oss to reduce to: ( 1(t s +mmax(t w,t o Art class, Fig 0.0, g Contention-filled All-to-All Mesh Broadcast Perform algorithm in two stes: 1. All-to-All ring bcast along rows [( 1(t s +mt w ]. AA ring bcast of combined result along cols [( 1(t s + mt w ] Cost: ( 1(t s +mt w +( 1(t s + mt w = ( 1t s +( 1(( +1mt w = ( 1t s +( 1mt w For non-square grids, do ste 1 on shortest dim Since subgrids of grids wt wraaround links do not have wraaround, this is likely to be a best alg for small messages, where t s dominates Contention unavoidable w/o wraaround, but if we treat mesh as 1-d ring, messages stay smaller, so contention hurts less, and we finish in normal time [( 1(t s + mt w ], which says if t s is small, contention savings may make this the better algorithm Might modify 1-d ring so that even rows go in increasing dir, odd rows go decreasing Note that sub-array of ring to is also not a ring, but rather a linear array, so actually none of these alg are contention free in ractice unless whole machine is used 1. Contention free All-to-All Bcast on Hyercube Done in log ( stes, wt mesg size doubling at each ste 1 func allallbc hy ( buff { 3 resbuff = buff ; for ( i =0; i < d ; i++ { 5 art = Iam ˆ (1<< i ; 6 send resbuff to art 7 recv mesg from art 8 resbuff = resbuf U mesg 9 10 Cost : log ( i=1 (t s + i 1 mt w log (t s +( 1mt w sub-hycube has all needed links Will cause contention on all other interconnects Does not red dom t w term Is gen cont-free, other tos not Intro to Parallel Comuting, Fig.11, g 16 All-to-all bcast on 8-node hycube

4 13. All-to-All Summary In general, issues such as blocking send/recv comlicate bcast beyond those described here Reduction os have the added comlexity of trying to overla comm & com, which comlicates them further If successful, cost same as bcast (t o done during t w! Unlike 1-to-all bcast, cannot use hycube alg tos w/o contention There are contention-free bcasts for ring, -D Mesh wt wraaround, and hyercube interconnects, wt dominant term cost ( 1mt w, and startu costs ( 1t s, ( 1t s, and log (t s, resectively Hyercube remains contention free when subdivided by ower of two, but mesh and ring do not Need rocess/rocessor maing to min non-hy contention Since all have same dominant term, ring is robably best large-case algorithm, since it will cause least contention due to fixed-size mesgs 1. Contention free All-Reduce on Hyercube An all-reduce (AKA leave-on-all reduction is symantically equivalent of erforming an all-to-one reduction followed by a one-to-all bcast. Done in log ( stes, using bidirectional exchange (sim AA: 1 allred BE ( int d, int m, TYPE buff { wrk = alloc (m; 3 for ( i =0; i < d ; i++ { art = Iam ˆ (1<< i ; 5 r i = Arecv ( art, m, wrk ; 6 send ( art, m, buff ; 7 wait ( r i ; 8 for (k=0; k < m; k++ 9 buff [ k ] += wrk [ k ] ; 10 Cost : log ((t s +m(t w +t o No overla of t o send most recent No cont on hyercube other tos rob use log ( a1,1a Redundant com heter danger Intro to Parallel Comuting, Fig.11, g 16 All-to-all bcast on 8-node hycube 15. Prefix Sum (Scan Oeration S = 0 S = 1 S = 0 1. Schematic for 8-node Bidirectional Exchange Always using links Any to but hyercube has contention Wt n 0,n 1,...,n 1 (1 er node Comute : s k = k i=0 n i nodes 1 func refix sum hcube (my num { res = my num; 3 msg = res ; d = log ( ; 5 for ( i =0; i < d ; i++ { 6 art = Iam ˆ (1<< i ; 7 send msg to art 8 recv hisnum from art 9 msg += hisnum ; 10 i f ( art < Iam 11 res += hisnum ; 1 13 return ( res ; 1 m = 1 in book For 1 num, forget contention Cost: < log ((t s +m(t w +t o For non-hy, can use any aa to build (eg., mesh-based aa [ < ( 1(t s +m(t w +t o ] Intro to Parallel Comuting, Fig.13, g 168 refix sum on 8-node hyercue

5 16. Scatter (One-to-All Personalized and Gather A node sending unique mesgs to all nodes is a scatter oeration, and collasing unique mesgs to one node is a gather. Programmed like one-to-all excet wt mesg size halving (no contention on any to: 1 roc scatter (d, m, TYPE buff { mask = (1<<d 1; 3 n = m ; for ( i=d 1; d >= 0; d { 5 abit = (1<< i ; 6 mask ˆ= abit ; 7 i f (( Iam & mask == 0 { 8 i f (Iam & abit!= 0 9 send (Iamˆabit, n, buff+n ; 0 else 1 recv (Iamˆabit, n, buff ; 3 n >>= 1; 5 log ( Cost: i=1 (t s + m it w = 1 i log (t s +mt w log ( i=1 note: n 1 i=1 i = n 1 n log (t s + mt w ( log ( 1 log ( = log (t s +mt w ( 1 = log (t s +m( 1t w Intro to Parallel Comuting, Fig.1, g 169 Intro to Parallel Comuting, Fig.15, g All-to-All Personalized Communication In all-to-all ersonalized communication (AKA: total exchange each node has buffers, all 1 destined for differing nodes. Like doing scatter (gather oerations at once Useful for FFTs, matrix transose, data base joins Comm attern of total exchange same as all-to-all intereconnects Contents & length of message different than all-to-all Will have contention on sub-grous excet for hyercube Label individual buffs by (src,dest air: Intro to Parallel Comuting, Fig.16, g Unidirectional Total Exchange on a Ring 19. Unidirectional All-to-All Personalized on a Ring Send all mesgs to right destined for other nodes Recv mesg left, take out my iece, forward on Sto when messages are emty (-1 stes Intro to Parallel Comuting, Fig.18, g 17 Send all messages to right destined for other nodes Recv mesg from left, take out my iece, forward rest on Sto when messages would be emty ( 1 stes Cost: 1 i=1 (t 1 s+m( it w = ( 1t s +mt w i=1 i = ( 1t s+ mt w ( 1+1 ( 1 = ( 1(t s +m t w Failure to exloit shortest ath doubles bandwidth needs!! i = 1 i = i = 3 1 3, 3,3 3 0, 0 0 1, 1 0,1 i = 1 0, 0,3 0, 0 0 1, 1,3 1, 1 0,1,3, 0 3,1 3, 3, 3 0,1,,3 1,,3 0,3 0, 0 0 1,3 1, 1 0,1, 0 3,1 3,

6 0. Bidirectional All-to-All Personalized on a Ring In ring, two links lead to the same node (left & right each node sends half his messages to the left, and the other half to right. 1. Send ( 1 messages bound to nearest nodes on right to right: i = 1 i = 1 0, 0 1,3 1 3, 3,0 3 0, (ts +mt w ( +1 Cost: i=1 (t s+m( 1 i+1tw = 1. Send ( 1 messages bound to nearest nodes on left to left: j = 1: 3 0, 0 0 1, 1 0,1 1 3, 3,3 j = : i+1tw = (ts +mt w ( Cost: i=1 (t s+m( 1 Tot cost: ( 1t s +mt w ( 1t s +mt w ( (( ( ( = ( 1 ( ( t s +mt +1 w ( ( Detailed Bidirectional All-to-All Personalized Cost Total cost = ( 1t s +mt w ( ( If odd, 1 = 1 = 1, examine only t w term: ( ( ( = 1 = ( ( ( ( 1 + = 1 +1 = ( 1 ( +1 Tot odd cost = ( 1 ( t s +m +1 t w If even, 1 = 1 +1, and ( 1 ( +1 + ( +1 = ( ( ( ( ( + = Tot even cost = ( 1t s +m ( Close enough: ( 1t s +m t w ( 1 t w = ( 1ts +m =, exam t w term: +1+1 = = ( ( ( ( ++ = ( = ( t w Uni or Bidir ring: t s = O(,t w = O( scatters (log (t s +m( 1t w : t s = O(log (,t w = O(. Proof of All-to-All Personalized Otimility on Ring Assume avg dist each m-length acket travels is: 1 i=1 i = ( 1 1 = Directly connected traffic = (# nodes(bufflen(dist traveled = ((m( 1( Since tot # links to share this load is, otimal bw cost is: (t w ((m( 1( m( 1 = t w BW cost of unid ring: ( 1(m t w otimal Problem is assumtion about avg dist: not true for best alg! Bidirec ring needs some buffer sorting, which can add to cost. True within factor, but scatters is true by factor again! scatters : (log (t s +m( 1t w = log (t s +( 1mt w AA ers has lower order t s term as well NOTE: All aa ring comm is all nearest neighbor store & forward works as well as cut-through routing! 3. All-to-All Personalized on Square -D Mesh Stes: 1. Assemble mesgs into c grous of r mesgs destined for each col (r = c =. Each row does aa ers on c mesgs of length mr 3. Each node assembles mesgs into r grous of c mesgs. Each col does aa ers on r mesgs of length cm Cost ste (use ring cost wt :m = m, = ( : t s +t w (( m ( 1 = ( t s +t w (m ( 1 Next hase same (r = c, tot cost: (t s +mt w ( 1 Ignores shuffle cost Ignores no-wra again! Intro to Parallel Comuting, Fig.19, g 17

7 . Small-message Otimal All-to-All Personalized on Hyercube Extend mesh alg to hy; do for each log ( d 1 sub-cubes: 1. Nodes exchange mesgs from/to other sub-cube Cost: (t s +m t wlog ( Rearr cost : mlog (t r traff = (# nodes(blen(dist avg dist trav = log ( Traffic = ((m( 1( log # hy links: log ( Best direct comm BW alg = (traff / (# links = ((m( 1( log log ( = m( 1 Best alg O(, this alg O(log ( for BW This alg O(log ( on t s, so otimal for small mesg Intro to Parallel Comuting, Fig.0, g BW-Otimal All-to-All Personalized on Hyercube Each air of nodes does bidirectional exchange: 1 roc aaers hy ({ for ( i =1; i < ; i++ { 3 art = Iam ˆ i ; send M art Iam to art 5 recv Mart Iam from art Overla on links are in different directions Known as e-cube routing Cost: (t s +t w m( 1 O( rather than O( No shuffling required! t s term O( rather than O(log ( This alg max BW by sending min mesg size (m Hurts t s since must erf stes Intro to Parallel Comuting, Fig.1, g Musings on Contention in All-to-All Only one-to-all alg is contention free on subartition of rocs Problem is one of assumed links (wraaround is missing, and all-to-all uses all links in (a1 uses / max This means that one message must san to activelinks Both bad, as alg work by having all nodes active at once Can assume will effectively half BW t w multilier My guess is book ignores contention in sub-art because given alg still otimal in face of contention: Make aa bcast [( 1(t s +mt w ] cont-free by doing 1-to-alls log ( ( [log ((t s +mt w ] : Double-m aa better or equal 1a ring bcasts would ieline for cost: (( 1 + (t s + mt w, but this requires wraaround too! If alg ot on links, differ only by for 1, so still O( otimal 7. Circular Shifts on Ring In a circular q-shift, all nodes send their buff to node Iam+q, and recv new data from (+Iam q mod, where 0 < q. Direct send best small mesgs, but full of contention (cut BW. Cost t s +min(q, qmt w To otimize BW, do min(q, q 1-ste shifts to left/right res. If q, shift to right times If q >, shift to left q times (eg. q = 7, shift left 1 Cost : min(q, q(t s +mt w Will have contention on linear array (effectively double m done -shift on 8-node ring (6 (7 (0 (1 ( (3 ( ( (7 (0 (1 ( (3 ( (5 (6 i = (0 (1 ( (3 ( (5 (6 (7 i =

8 8. Circular q-shift on a Square -D Mesh Assume row-major grid: 1. Do q mod shift along rows. q mod cols should have shifted their val down one row, so do this along columns (q mod cols articiate 3. Do q shifts along col In ractice, shift left- /right/u/down for min movement Max shift dim Maxcost:(t s +mt w ( +1 Intro to Parallel Comuting, Fig., g Imroving Bandwidth Performance by Slitting u Messages Discussed AA alg that are O( BW ot, but not 1a, a1 & all-reduce: All-reduce is BW ot for hyercube only, else do a1 red, 1a bcast Problem is that, unlike aa os, 1a, etc, do not use all links from start (recursive doubling, so link usage criled For large m-size, can slit mesg into chunks, send them out to bring links earlier in the rocess Additional t s cost to reduce BW cost In ractice imlement t s otimized, t w otimized, oss hybrid, and switch amongst them de on m We can use O( ot aa algs to build O( algorithms for: 1. One-to-all broadcast. All-to-one reduction 3. Non-hyercube all-reduce 30. Asymtotically Otimal One-to-All Broadcast Scatter & aa bcast O( ot, so build O( ot 1a bcast from them by slitting buff into buffs of length n = m (costs for ring, works to: 31. Asymtotically Otimal All-to-One Reduction AA reduction & gather O( ot, so build O( ot a1 reduction from them by slitting buff into buffs of length n = m (costs for ring, works to: Reg 1a: log ((t s +mt w O(BW = log ( 1. Do a scatter on n-len buffs: log (t s + ( m ( 1tw. Do a aa bcast on n-len buffs: ( 1t s + ( m ( 1tw Total cost = (log (+ 1t s + ( m ( 1tw 1 extra t s O(BW = 1 (mt w O( BW ot, since it costs O( same as send! Intro to Parallel Comuting, Fig.1, g 169 scatter [log (t s +m( 1t w ] Intro to Parallel Comuting, Fig.8, g 157 all-to-all bcast [( 1(t s +mt w ] Reg a1: log ((t s +mt w O(BW = log ( Intro to Parallel Comuting, Fig.8, g Do a aa reduc on n-len buffs: ( 1t s + ( m ( 1tw. Do a gather on n-len buffs: log (t s + ( all-to-all reduction [( 1(t s +mt w ] m ( 1tw Total cost = (log (+ 1t s + ( Intro to Parallel Comuting, Fig.1, g 169 m ( 1tw 1 extra t s O(BW = 1 (mt w O( BW ot, since it costs O( same as send! gather [log (t s +m( 1t w ]

9 3. Asymtotically Otimal All-reduce All-reduce can be build out of all-to-one reduction followed by a oneto-all broadcast: All-reduce : (m-len a1 reduction + (m-len 1a bcast All-reduce : ([ m -len aa reduce] + [m -len gather] + ([ m -len scatter] + [m -len aa bcast] [gather] followed by [scatter] = no All-reduce : ( m -len aa reduce + m -len aa bcast aa bcast Total cost: ( ( 1t s + ( m ( 1tw ( 1 t s rather than log ( O(BW = 1 (mt w Normal all-reduce: log ((t s +t w m O(BW = log ( (mt w Intro to Parallel Comuting, Fig.8, g All-ort Communication We ve been assuming 1 send/recv at a time, but some archs allow to send/recv on all network orts simultaneously: All-ort communication rovides seedu, but doesn t change O( much: Max seedu O(log ( for hyercube Only constant seedu for mesh, ring, so no asymtotic diff Very difficult to rog: Contention-free routes mesgs must be found Must make mesgs large enough to be slit, w/o comute time dominating Even harder to kee efficient as logical nodes hysical Mem must be able to kee u (best if MP has q loc mems all comm 3. Communication Cost Summary Oeration Ring time t s O( BW O( one-to-all/a1 log ((t s +mt w, bcast/reduc (log (+ 1t s + ( m( 1 tw log ( O(1 all-to-all ( 1(t s +mt w O( O( bcast/reduc all- log ((t s +mt w, reduce ( ( 1t s + ( m( 1 ts O(log ( O(1 scatter/gather log (t s +m( 1t ( w O(log ( O( all-to-all ers ( 1t s + mt w O( O( circ q-shift min(q, q(t s +mt w O( O( Could ask above for hyercube or mesh All-to-all ers O( for ring, O( for mesh, O( for hyercube Table in book (g 187 on hyercube costs claims ( 1 = O(1! 35. Oeration to MPI Name Maing Oeration MPI Name One-to-all broadcast MPI Bcast All-to-one reduction MPI Reduce All-to-all broadcast MPI Allgather All-to-all reduction MPI Reduce scatter All-reduce MPI Allreduce Gather MPI Gather Scatter MPI Scatter All-to-all ersonalized MPI Alltoall

Program Performance Metrics

Program Performance Metrics he parallel run time (par) is the time from the moment when computation starts to the moment when the last processor finished his execution he speedup (S) is defined as the