ECE 669 Parallel Computer Architecture

Size: px

Start display at page:

Download "ECE 669 Parallel Computer Architecture"

Philip Whitehead
5 years ago
Views:

1 ECE 669 Parallel Computer Architecture Lecture Interconnection Network Performance

2 Performance Analysis of Interconnection Networks Bandwidth Latency Proportional to diameter Latency with contention Processor utilization + Physical constraints

3 Bandwidth & Latency Direct Latency Average latency k d : nk worst case k Bandwidth per node - more complex - if all near neighbor messages B - If average travel dist then? - Let n ( k - ) - torus - dir channels ~ n k 4 torus dir channels ~ n k 3 no end conns dir channels Avg DIST = nk 3 Msg size = B

4 Analogy If each student takes 8 years to graduate And if a Prof. can support 0 students max at any time How many new students can Prof. take on in a year? Each year take on 0 8 Similarly: Network has Nn channels Nn # of flits it can sustain = Nn # of msgs it can concurrently sustain= Each msg flit uses nk B channels to dest So 3 N x nk 3 = Nn B or BW per node = x = 3 kb 8x =0

5 Another way of getting BW is: Max # msgs in net at any time = Nn kn B These take = 3 cycles to get delivered, during which time no new mgs can get in I.e. we can inject Nn B msgs every kn 3 or # injected per node per cycle = Nn 3 B kn = 3 Bk cycles Note: We have not considered contention thus far. In practice, latency shoots up much before we achieve the theoretical due to contention. N

6 What s a good performance metric to analyze networks Possibilities: - Bandwidth - Latency Discuss - Bandwidth - good when latency not an issue. I.e., if we can overlap all communication with computation Estimate of how much info we can transport - Latency - good when not much parallelism exists (critical sections, e.g.), or when communication cannot be overlapped with computation. U = + req rate * L (U) L BW - Effective processor utilization best metric - but always not easy to derive.

7 Performance Analysis First, analysis based on latency alone - Taking contention into account, but ignoring technological constraints Such analysis is o.k., as long as we do not compare workloads with different traffic rates With no contention, Latency T = hops + M + B Network Hops Mem Delay M B With contention, Latency T = (+c) hops + M + B Average contention delay cycles at each switch

8 Direct k-ary n-cube (Agarwal paper) c = hops = nk d r B ( ) - r ( k d - ) k d + n Ł ł r = Fraction of bandwidth consumed, = mbk d Distance traveled in a dim. in a unidirectional torus k d = k - m = traffic rate, [probability of request] k-ary, n-cube latency Ø \ T = + º Œ r B ( ) - r ( k d - ) k d ø + Ł n ł ß œ nk d + M + B

9 A better metric Processor utilization U Or, how network delay impacts overall performance U = Fraction of time spent doing useful work M Net T cycles Ideally, Each instruction takes cycle If probability of message (remote mem req) on each useful cycle is m and corresponding latency is T Each instruction takes +mt cycles or U = P + mt

10 Deriving U is not easy Notice m = Probability of a message on a useful processor cycle m eff = m U = probability of msg on any cycle T = f (m eff ) = network delay as a function of m U = + mt or # of useful processor cycles depends on T and m eff Cyclic dependence! T T T 0 U U m U 0 m =U 0 m 0 m m m m 0 BW T T 0 T

11 Processor utilization k-ary n-cube m = Probability of msg on useful cycle Channel utilization Latency Processor utilization Two equations [] and [] Two unknowns: T, U Solving, T = T 0 Where and + Bk d 4 U = Ø T = + º Œ - + mt m + r B ( ) - r r = mbk d U ( k d - ) k d U = + mt + n Ł Ł T 0 - Bk d + m T 0 = nk d + M + B = unloaded network latency ł ø ł ß œ nk d + M + B [] [] + B ( k d - ) ( n + )

12 Technology Constraints How do technological constraints impact network design, performance? Constraints Wire lengths limit maximum speed of clock # pins limit size of nodes Bisection width limits # of wires per channel Software Can we compile to the architecture? Can users specify programs? Is it scalable?

13 The performance equation becomes Latency Recall, Now, switch delay f(n) T = [ ( + c) hops + B] T = cycle time [ T switch + T ( wire n ) ] See paper for details. wire delay n E. g ( + c ) B hops + W ( n ) msg length Summary: Constraints favor small n. k N Ø ºŒ - ø ßœ channel width: function of bisection width - n E. g k N n

Tight Bounds on the Ratio of Network Diameter to Average Internode Distance. Behrooz Parhami University of California, Santa Barbara

Tight Bounds on the Ratio of Network Diameter to Average Internode Distance Behrooz Parhami University of California, Santa Barbara About This Presentation This slide show was first developed in fall of