Modelling and Compensating for Clock Skew Variability in FPGAs

Size: px

Start display at page:

Download "Modelling and Compensating for Clock Skew Variability in FPGAs"

Gerald Lawrence
6 years ago
Views:

1 Modelling and Compensating for Clock Skew Variability in FPGAs Pete Sedcole, Justin S. Wong and Peter Y. K. Cheung Department of Electrical & Electronic Engineering, Imperial College London South Kensington campus, London SW7 2AZ, UK Abstract As integrated circuits are scaled down it becomes difficult to maintain uniformity in process parameters across each individual die. To avoid significant performance loss through pessimistic over-design new design strategies are required that are cognisant of within-die performance variability. This paper examines the effect of process variability on the clock resources in FPGA devices. A model of variation in clock skew in FPGA clock networks is presented. Techniques for reducing the impact of variations on the performance of implemented designs are proposed and analysed, demonstrating that skew variation can be reduced by 70% or more through a combination of phase adjustment and clock rerouting. Measurements on a Virtex-5 FPGA validate the feasibility and benefits of the proposed compensation strategies. 1. Introduction The fabrication of integrated circuits involves processes and materials that cannot be perfectly controlled. Manufacturing variations result in devices where performance and power consumption varies, both between dice and, more recently, between circuit elements within a single die. This variability is expected to increase as transistor sizes are scaled down [1]. Field-Programmable Gate Arrays (FPGAs), often on the cutting edge of technology scaling, are susceptible to process and material variations, possibly more than other highperformance integrated circuits. Unlike ASICs, the critical paths of the circuit the FPGA implements is not known until after fabrication, which results in particularly pessimistic circuit timing. Since variability cannot be eliminated by improving the fabrication process, new design techniques are required that are aware of and manage the variability. In our previous work, we reported on measurements of logic and routing variation in FPGAs using both ring oscillators [2] and an improved atspeed testing method [3]. We have also developed techniques for quantifying the variability in clock skew within FPGAs [4], which indicated that clock skew variability is comparable to logic path delay variability. With the knowledge gained from the experimental work in [4], this paper proposes a model to predict the effect of withindie parameter variations on FPGA clock networks. Because of the flexibility required in the clock routing within an FPGA, the structure of the clock network is substantively different to an ASIC clock tree, and is affected differently by variability. The model predicts the variation in the clock skew between any two register locations. An accurate model of the clock skew variation is beneficial, as it allows timing tools to reduce the required guard-band for the skew. Furthermore, we propose post-configuration compensation techniques to reduce the impact of clock skew variability, enabling more aggressive timing to be achieved. These are analysed using the clock skew variation model. The feasibility of the techniques is demonstrated by experimental measurements from a Xilinx Virtex-5 FPGA. 2. Background 2.1 Related work The study of the effect of process variability on clock trees has been previously examined in ASIC devices. This include work employing Monte Carlo simulations [5], [6] as well as approaches based on canonical or numerical analysis of the classical H-tree clock structure [7], [8]. Unlike an FPGA clock network, which is fixed (although programmable), in ASICs the clock tree design and routing can be optimised to the application before fabrication. By including awareness of variability into the optimisation process, the impact of variation can be reduced. For example, Venkataraman, Sze and Hu have investigated skew scheduling and clock routing incorporating variability awareness [9]. Rajaram and Pan describe a technique for reducing skew variation by inserting cross-links in the clock tree [10]. Skew variation may be corrected post-fabrication by using active de-skewing techniques, commonly employing elements in the clock tree with adjustable delays [11], [12], [13]. This technique has recently been investigated for FPGAs [14], [15]. The only published work to date on FPGA clock variability is our previous report on the measurement of skew variability [4]. An in-depth analysis of the impact of variability on FPGA clock trees is so far lacking in the literature. 2.2 FPGA clock trees The clock network in an integrated circuit is generally designed to manage the skew between any two points in the device. A design with zero nominal skew can be achieved by employing the well-known H-tree structure. An FPGA clock network must balance the minimal-skew requirement with sufficient flexibility to implement the clocking requirements of many different circuits. Inevitability, providing this flexibility

2 Programmable branch buffer Switch block Central buffer Programmable quadrant buffer Clock octant Fig. 2. The clock tree structure in a Stratix-II type of device. The structure is based on an H-tree, resulting in clock octants regardless of the size of the device. U 11 Source register u U 12 U i =V i fori = 1..7 Signal path Fig. 1. The clock tree structure in a Virtex-4/5 type of device. The device is divided into a number of fixed-sized clock regions. reduces the symmetry in the clock distribution, which has implications for the sensitivity of the clock to variations. Clock networks in FPGAs generally come in two flavours. A spine-and-branch approach is typified by the Xilinx Virtex- 4 [16] and Virtex-5 devices [17], and is represented by the diagram in Fig. 1. The clock is distributed on a hierarchical network of linear spines, where each spine taps directly off the higher-level spine. In the Virtex-4 and -5 architectures, all clock regions are of equal size: larger devices have a higher number of separate clock regions. The Stratix-II [18] and -III [19] devices from Altera favour a structure that resembles the traditional H-tree design, as shown in Fig. 2. Again, the structure is hierarchical: the higher levels of the hierarchy use an H-tree network, which minimises delay differences. At the lower levels, the clock is distributed to rows of logic blocks along linear branches. With this structure the device is divided into clock octants (or sixteen parts for the Stratix-III) regardless of the size of the device. Although the clock networks in Altera devices are more balanced than in those from Xilinx, FPGAs from both vendors exhibit definite differences in clock routing delay across the chip. Point-to-point clock skew (as reported by vendor timing tools) is typically of the order of hundreds of picoseconds in mid-range devices. In all cases, the clock network comprises duplicate resources to enable multiple clocks to be distributed throughout the device. A Virtex-5 XC5VLX50 device, for example, has 32 central buffers each of which drives a separate vertical spine, U 10 U 9 U 8 V 8 V 9 U 7 U 6 U 5 V 10 V 11 V 12 V 13 U 4 U 3 U 2 U 1 = V 1 Unit length wire v Destination register Fig. 3. Example of clock routing to two spatially separated registers at locations u and v. The first seven labelled resources are shared in this example. and each region has 10 horizontal spine and branch lines [17]. Hierarchical levels are connected by some form of crossbar switch, and buffers at any level in general can be disabled to reduce dynamic power dissipation. In addition to the global clocking network, FPGAs also have available regional clock buffers and distribution networks. These are not considered in this paper, and will be the subject of future work. 3. Clock Network Variation 3.1 Model In order to gain a greater comprehension of the effects of variability on the clock skew, an analysis of delay variations in the clock network is presented in this section. The outcome of the analysis is a model, which is then used in Section 4 to study strategies that compensate for clock skew variability. Consider two register locations in an FPGA, placed at positions u and v. As shown in Fig. 3, the clock is routed to each location along the dedicated clock resources, and the

3 resources may be shared for some part of the routing. In order to model the spatial correlation in delay variation, each wire segment is divided into unit lengths. The unit length is arbitrary, although the accuracy of the model will be better with a smaller unit length. The deviation from nominal delay along a wire unit, and through each buffer, from the source to location u is described with a variable U i. Similarly, the variation in delays from the source to v along the clock tree is modelled with variables V i. Note that all variables U i, V i have zero mean. The actual clock skew between locations u and v is s(u, v) = s 0 (u, v) + i w i U i i w i V i (1) where s 0 (u, v) is the nominal clock skew. The summations exclude the variables corresponding to shared clock resources (such as U 1 to U 7 and V 1 to V 7 in the example of Fig. 3) as they do not contribute to the skew between u and v. The variable w i is a weighting, equal to 1 for buffers and proportional to the wire segment length for wire units. The values of the weights are determined in the next section. The variance in skew is: [ Var [s(u, v)] = Var w i U i ] w i V i (2) i i This can be expanded: Var [s(u, v)] = i w 2 i Var [U i ] + i i j w i w j cov [U i, U j ] + i j 2 i,j w 2 i Var [V i ]+ w i w j cov [V i, V j ] w i w j cov [U i, V j ] (3) The variance of the clock skew between two locations in the FPGA can therefore be calculated from the covariance matrix of the buffer and wire unit delays of the clock tree routing. It is necessary to determine the covariance between each buffer and wire delay. There are three cases to consider: the covariance between two buffers, the covariance between two wire units, and the covariance between a buffer and a wire. Buffer-buffer: Where U i and U j correspond to buffer delay variation, we assume a homogeneous and isotropic spatial correlation function, ρ b (d), which only depends on the distance d between the two buffers. This assumption is common in the literature (e.g., [20]). The covariance is then simply: cov[u i, U j ] = σ Ui σ Uj ρ b (d) (4) Wire-wire: For the case where U i and U j both correspond to wire delays, a similar assumption is made for the correlation in delay variation. In this case the spatial correlation function is ρ w (d), where d is the distance between the mid-points of the two wire units. Thus cov[u i, U j ] = σ Ui σ Uj ρ w (d) (5) Buffer-wire: The model assumes that there is no spatial correlation between buffer delay variation and wire delay variation. This is reasonable, since variation in buffer delay is the result of FEOL processes 1, whereas wire delay variation is a consequence of BEOL 2 process variation. Therefore where U i and U j correspond to a buffer delay and a wire delay: cov[u i, U j ] = 0 (6) It should be noted that similar equations can be used to express the covariance in the clock routing to v and the covariance between clock trees (cov[v i, V j ] and cov[u i, V j ] respectively). 3.2 Weights We now determine the values to assign to the weights w i, based on the Elmore delay of a wire [21]. Recall that in the Elmore delay model, the total resistance and capacitance of a wire is divided into a finite number N of distributed resistances and capacitances R i and C i, i = 1,..., N. For the case of interest, each R i and C i are random variables and correspond to the unit lengths described earlier. We define a time constant X i of each unit length of wire X i = R i C i, such that U i = X i E[X i ]. Note that dxi du i = 1. The propagation delay of the wire is given by [21]: N N t w = R i C j (7) i=1 We are interested in the sensitivity of a change in a variable U k to the overall propagation delay of the wire. Therefore we calculate the partial derivative: j=i t w = t w dx k (8) U k X k du k = k 1 N C k R i + R k C k + R k C j (9) X k i=1 = 1 k 1 R i R k C k i=1 N j=k+1 j=k+1 C j (10) This value is also a random statistic. We can calculate the mean of this value by taking the expected value, noting that E[R i ] = R and E[C i ] = C: [ ] [ ] k 1 tw 1 [ ] 1 N E = E R+1+E C N (11) U k R k C k i=1 j=k+1 We see that variation in a wire unit will cause variation in total delay relative to the number of wire units in that segment. In other words, the variation in delay of the wire increases superlinearly with length. This makes intuitive sense, since the wire delay also increases superlinearly with length. 1 Front-End-Of-the-Line, the fabrication steps involving the patterning of silicon. 2 Back-End-Of-the-Line, the fabrication steps for depositing metal layers.

4 (a) Virtex-5 style device, high correlation in spatial variation. (b) Virtex-5 style device, low correlation in spatial variation. (c) Stratix-III style device, high correlation in spatial variation. (d) Stratix-III style device, low correlation in spatial variation. Fig. 4. Clock skew variation modelling results. Two types of device are modelled, one with a spine-and-branch clock network, Virtex-5 style, and one with an H-tree clock network, Stratix-III style. The variance in clock skew relative to a fixed location (25,5) is computed for both high and low spatial correlation. Using the variance values the 3-σ guard-banding values are plotted as a function of location. The z-scale of the plots are in units of the standard deviation in delay of one clock buffer. The weight w i for a wire unit is set to the total number of units in the segment. This weighting only applies to wire units, so for buffers w i = Case study The model derived above can be used to calculate the expected variance of the clock skew between any two locations on the FPGA. Conventionally, variation in delay or skew is accounted for by guard-bands: margins added to the nominal delay or skew to allow for the worst-case variation. Typically a margin of three times the standard deviation is used for the guard-band. Thus, if the clock skew has a standard deviation σ of 100ps, the guard-band will be ±300ps. Fig. 4 shows some 3σ guard-bands calculated using the proposed model above for two different devices types, corresponding to the two FPGA clock-tree styles discussed in Section 2.2. For each of the two device types, two different levels of spatially-correlated variation are modelled. For the high correlation model, the spatial correlation functions ρ b (d) and ρ w (d) fall as d 0.3 and asymptote to 0.2. For the low correlation model, ρ b (d) and ρ w (d) fall as d 2 and asymptote to 0.1 The total level of variability is set to σ U = 10% of delay for buffers and σ U = 5% for wire units. The plots are calculated assuming the register at one end of a signal path has been placed at location (25, 5) in the FPGA. The required guard-band to add to the clock skew for the second register is location-dependent. As expected, if the two end-point registers are placed within the same clock region, the required guard-banding is lower than if they are placed further apart. Although there are differences between the spine-andbranch and the H-tree clock distribution schemes, the total level of variability in the clock skew remains broadly similar for devices of the same size. Note also that where the variation has low spatial correlation, the necessary guard-banding is less position-dependent (the plots are flatter ), as would be expected, although it is still advantageous to place both registers within the same clock region. The model can be used during place-and-route to provide more aggressive timing than would be possible by using a single global guard-band value for skew. The model calculations are computationally non-complex and could be computed as necessary during place-and-route. Alternatively, to avoid extra time overhead during place-and-route, the guard-band values could be pre-computed for various register locations and approximations used during place-and-route. 4. Variation Compensation In this section we propose methods to mitigate variability in clock skew. The effectiveness of the methods are studied

5 TABLE I MODEL PARAMETERS Model parameter Value Logic block rows 80 Logic block columns 40 Buffer delay µ = 1.0, σ = 10% Wire unit delay µ = 0.1, σ = 5% High spatial corr. function ρ(d) = 0.3d Low spatial corr. function ρ(d) = 0.1d by modifying the model of Section 3.1, and by experiments on a Xilinx Virtex-5 XC5VLX50 FPGA. 4.1 Clock phase adjustment Modern high-end FPGAs include several very flexible clock generating resources, such as PLLs and Digital Clock Managers [17], [19]. In both Stratix and Virtex devices, in addition to being capable of synthesizing many clock frequencies, these clock generators are also able to produce phase-shifted clocks where the amount of phase-shifting can be changed at runtime. Using this capability, it is possible to generate an additional clock of the same frequency as the main clock but phaseadjusted to compensate for skew variations. The amount of phase adjustment can be tuned for each FPGA. Since this requires an additional DCM/PLL to generate the second clock, it is only possible if there are unused DCMs/PLLs in the FPGA. Although this technique can compensate exactly for the skew variation between any two particular register locations, it clearly cannot achieve this for all paths, as this would require a DCM/PLL for every register on the FPGA. A practical approach is to compensate for the random skew variation between two clock regions, by supplying one of the regions with a phase-adjusted clock tuned to compensate for the average offset in skew between the two regions. This technique we call regional phase compensation. A further improvement may be possible by constraining the placement of registers within each region. If registers are placed close together they are more likely to experience the same variation in clock skew. Therefore, by placing all source and sink registers of critical paths between the two regions close together, the phase adjustment can be more finely tuned to the local variation. We term this local phase compensation. It is necessary to modify the model of Section 3.1 to include these adaptations. This is relatively trivial. Examining Fig. 3, it can be seen that phase compensation will cause the variation in skew between the two divergent branches of the clock tree to be exactly cancelled up to some fixed point along each branch (for example, up to just after U 9 and V 9 ). When calculating the variance of the phase compensated technique, it is sufficient to disregard the terms corresponding to the clock tree before the compensation points. Note that there will be an increase in power consumption by using spare DCM/PLL resources. If there are no such spare resources, or the power overhead is unacceptable, gains may still be made by splitting the main clock and routing it through two central clock buffers. Stochastic differences in the (a) After regional skew correction by clock phase adjustment. (b) After local skew correction by clock phase adjustment. Fig. 5. Required skew guard-bands after compensating for skew variation with dual phase-adjusted clocks, based on a model of a Virtex-5 style FPGA with high correlation in spatial variation. Assumes a source register placed at location (25, 5). Guard-band values are again plotted relative to clock buffer standard deviation in delay. buffer delays will produce a phase shift in the two resulting clocks. The phase shift will not be controllable however, so the effectiveness of this approach is limited. The results of the modified clock skew model are shown in Fig. 5. Again, one register is fixed at location (25, 5). The guard-banding required when two registers are supplied by phase-adjusted clocks is plotted as a function of placement location of the second register. Fig. 5(a) shows the case where the clock phases are adjusted to cancel regional variations in skew. Fig. 5(b) is an example of the more aggressive local phase compensation. This assumes that all registers for critical paths between regions are placed within 3 3 logic blocks in each region. The graphs can be compared to the baseline case in Fig. 4(a). The regional phase compensation scheme reduces the guard-band by up to 42%, and the local phase compensation reduces the guard-band by up to 49%. Both schemes are most effective for registers placed a long way apart. 4.2 Clock resource re-routing As mentioned in Section 2.2, the buffers and wires that are used for clock signal routing in FPGAs are duplicated at each level, to provide flexibility and to allow multiple clocks to be distributed. Stochastic variations in the buffers, wires and switches will cause each duplicate resource to exhibit

6 different delays. It is possible to use these differences, given a particular FPGA and one clock net of interest, by selecting a clock routing which gives the most optimal clock skew. As an example, consider the Virtex-5 FPGA from Xilinx. In this device there are 10 nominally identical horizontal clock spines per region. At each register there is a multiplexer which determines which clock spine is connected to the clock input of the register. The clock signal can be routed on all 10 clock lines simultaneously, and the best signal selected at each register by reconfiguring the multiplexer. The best or most optimal signal may be the signal with the closest to nominal skew. Alternatively, a clock signal with a deviation in skew could be selected to compensate for reduced slack caused by path delay variations. Nominal skew objective: By choosing the signal with the closest-to-nominal skew, the skew variance will be reduced and therefore the required guard-band will also be smaller. To include this in the model, we need to quantify the effect of selection on the skew variance. Firstly, note that the duplicated clock resources are physically close together, so will exhibit the same correlated delay variation. The difference in skew of N duplicated resources is therefore a stochastic quantity of zero mean, which we will denote by the random variable X i, i = 1,..., N. Assuming that X i is approximately normally distributed with variance σ 2, its probability density can be described by ( ) x P( x < X i < x) = erf (12) 2σ where erf(x) is the error function. Let us label the value of X i which is closest to zero by Y. It is straightforward to show that: f Y (x) = P(Y = x) = N ( ) [ ( )] x 2 N 1 x exp 2πσ 2σ 2 1 erf (13) 2σ The variance of Y is defined as Var[Y ] = x2 f Y (x)dx which, while not possible to solve analytically, can be computed numerically. For N = 10, the variance Var[Y ] = σ 2. This is applied to the model of (3) by scaling the variance terms corresponding to the duplicated resources. The covariance terms remain the same. Positive skew objective: For a given register, instead of selecting the clock routing that gives the most nominal skew, one may choose to select the routing that gives the most positive skew. This will yield the most slack for paths that end at that register, although at the expense of slack for paths originating at the register. In this case, we select the maximum value of X i, which is the order statistic X N. The variance values in the model of (3) will be replaced by Var[X N ], and the guard-band will be reduced by E[X N ]. Order statistics have been extensively studied; mean and variance tables are readily available, such as in [22]. (a) Nominal skew objective. (b) Most positive skew objective. Fig. 6. Guard-bands after compensating for skew by clock phase adjustments and clock re-routing, for a high amount of spatially correlated delay. Results from the modified models for the clock resource rerouting strategies are plotted in Fig. 6. Both models assume that regional differences in phase are compensated for by the clock phase adjustment described above, and then the best of 10 available regional clock trees are used to route the clock signal. The graphs should therefore be compared to Fig. 5(a). By choosing the resources which give the nearest to nominal skew, the guard-band can be reduced by an additional 10% to 40% over regional phase compensation alone. The benefit is greatest when the two registers are placed close together. The most positive skew objective yields improvements of 30% to 90% additional reduction in guard-band compared with regional phase compensation. The results in Fig. 5(a) and Fig. 5(b) are based on a high level of spatial correlation. The model has also been used to investigate the situation where the delay variation is more stochastic. The results are broadly similar. The guard-band result for the clock resource re-routing for nominal skew is shown in Fig. 7 as an example. Compared to the highly correlated variation case of Fig. 6(a) the method offers less of an improvement for closely-spaced registers, and overall the guard-band has less locational dependence, as would be expected. 4.3 Experimental results In order to validate the feasibility of the proposed skew variability compensation techniques, experiments have been

7 Fig. 7. Guard-band after compensating for skew by clock phase adjustments and clock re-routing. The model assumes low spatial correlation in delay variation and the clock re-routing targets nominal skew. 4 possible central buffer locations Phase adjust Clock generation 4 9 possible regional buffers Down paths Launch Test path Capture x16 9 Up paths Capture Test path Launch Fig. 8. A simplified diagram of the test circuitry used in the Virtex-5 experiment. Two clock regions ( top and bottom ) are supplied with separate clocks. The phase offset between the clocks can be adjusted dynamically. A total of 32 paths connect the two regions, 16 in either direction. performed on a Xilinx Virtex-5 XC5VLX50-1 FPGA. These are designed to determine whether or not it is possible to change the clock phase for a region to compensate for skew variation, and if different parallel clock resources do actually exhibit different delays. A simplified diagram of the test circuitry used is shown in Fig. 8. Two clock regions in the FPGA were supplied with separate clocks of the same frequency, where the phase offset between the two clocks can be adjusted dynamically. The phase adjustment was achieved using the Virtex embedded Digital Clock Managers [17]. Thirty-two paths were placed and routed in the FPGA between the two clock regions, 16 in each direction. Paths originating in the lower of the two regions are termed up paths, the others down paths. The observable delay of each path was able to be accurately measured using the method reported in [3]. The observed delay of the path in reality is the sum of the path propagation delay and the clock skew between the start and the end registers of the path. An additional 192 paths were placed and routed in other regions of the FPGA, and were used for calibrating the measurements for environmental changes. The experiment involved measuring the observable delay of the 32 test paths for different clock phase offsets, and when different clock resources were used to route the clock of x16 Fig. 9. Empirical measurements and post-calibration values of observed path delay for all 32 paths (16 up and 16 down ) under test. Each path is measured 36 times, each time with a different combination of central buffer location and regional clock routing. the top-most region. Since the paths under test are invariant, any change in observed delay is therefore actually caused by changes in clock skew. The raw measured path delays for all 36 combinations of clock routing are plotted in the left half of Fig. 9. It can be seen that changing the resources the clock is routed on causes a change in measured delay of up to ±50ps. This is significant when compared to the variation in LUT delay, which has a standard deviation of approximately 11ps in this device [4]. The mean measured path delay is 3705ps. Note that there is a difference in the ensemble measurements of the up paths (3807ps) compared to the down paths (3603ps). There are also differences in delays between paths within the up group and within the down group. These differences are partially due to process variability and partially due to differences in the placement and routing of each path. Since we are interested in compensating for clock skew variation, it is necessary to calibrate the initial data-set to produce a set of values where the the effect of other sources of variation in the delay have been removed. The measurements were first calibrated to remove expected differences in delay using the path and skew timing reported by the vendor timing tools. The resulting values for the delay of each path were then shifted towards the mean to counteract the variance introduced by the LUT in each path. The resulting post-calibrated values, plotted in the right half of Fig. 9, are somewhat artificial but realistic. The effect of different experiments are summarised in Fig. 10. The graph shows the timing offset (degradation) of the slowest path for a given test, relative to the case of nil variation. Nil variation is estimated as the mean of all calibrated delays. In order to gain an insight into how the degree of connectivity between regions affects the results, three bars are shown in each experiment: the case where the regions are connected by just one path in each direction, as well as for four paths and sixteen paths. To give a meaningful sense of scale to the results, standard deviations of LUT delay, σ L, are also plotted on the graph. Using the initial assignment of clock resources and no phase

8 Fig. 10. The observed delay of the slowest path using different compensation techniques. The delay is plotted relative to the nil-variation baseline. Different numbers of paths are considered: 1, 4 and 16 paths in either direction. A scale of LUT delay standard deviations is also plotted for reference. correction, the slowest path delay is degraded by over 10σ L compared to the case of zero skew variation. This is mainly due to the difference between the up and down path delays. By trying four different locations of the main clock buffer, but changing nothing else, this can be reduced by approximately half in this particular instance. A much greater improvement is possible by actively adjusting the clock phase between the two clock regions to cancel the difference in the up and down delays. Using this technique, the timing degradation is reduced to about 1 to 4σ L. The effectiveness of this technique is to some extent limited by the granularity of the phase adjustment possible using the Virtex- 5. With infinitely-adjustable phase, the improvement would be slightly better, as indicated by the Phase (ideal) results. The best result from this series of experiments came from a combination of phase adjustment and clock re-routing. By judicious selection of resources on which to route the clock to the top region, the effect of skew variation could be completely cancelled for the cases of one or four paths. Obviously, the experimental setup does not account for the negative impact on slack of other circuit paths by using this proposed approach. Nevertheless, it demonstrates the effectiveness the technique can have. 5. Conclusions The clock distribution network in FPGAs are substantially different to those in ASICs. The effect of process variability on clock skew, and approaches to mitigate such effects, must therefore also be different. This paper described a proposed clock skew variability model for FPGAs. The model can be used to predict guard-band requirements on clock skew. In addition, two techniques for compensating for skew variability were presented. These involved adjusting the phase of the clock between regions, and using the stochastic differences in duplicated clock resources to achieve better skew timings. Results predicted by the model show that these techniques could significantly reduce the skew guard-band. Phase adjustments alone reduced the guard-band by almost 50%; by additionally routing the clock through the optimal resources the guard-band could be reduced by 70% or more. A reduced skew guard-band ultimately yields better timing. The feasibility of the techniques were also verified experimentally using a Virtex-5 FPGA. Acknowledgements The authors wish to acknowledge the financial support of the EPSRC under Platform Grant EP/C549481/1. References [1] S. R. Nassif, Design for variability in DSM technologies, in Proc. IEEE International Symposium on Quality Electronic Design, [2] P. Sedcole and P. Y. K. Cheung, Within-die delay variability in 90nm FPGAs and beyond, in Proc. IEEE International Conference on Field Programmable Technology, [3] J. S. Wong, P. Sedcole, and P. Y. K. Cheung, Self-characterization of combinatorial circuit delays in FPGAs, in Proc. IEEE International Conference on Field Programmable Technology, [4] P. Sedcole, J. S. Wong, and P. Y. K. Cheung, Characterisation of FPGA clock variability, in Proc. International Symposium on Very Large Scale Integration, [5] V. Mehrotra and D. Boning, Technology scaling impact of variation on clock skew and interconnect delay, in International Interconnect Technology Conference, [6] S. Zanella, A. Nardi, A. Neviani, M. Quarantelli, S. Saxena, and C. Guardiani, Analysis of the impact of process variations on clock skew, IEEE Transactions on Semiconductor Manufacturing, vol. 13, no. 4, pp , Nov [7] A. Agarwal, V. Zolotov, and D. T. Blaauw, Statistical clock skew analysis considering intradie-process variations, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 23, no. 8, pp , Aug [8] M. Hashimoto, T. Yamamoto, and H. Onodera, Analysis of clock skew variation in H-tree structure, in Proc. IEEE International Symposium on Quality Electronic Design, [9] G. Venkataraman, C. N. Sze, and J. Hu, Skew scheduling and clock routing for improved tolerance to process variations, in Proc. Asia and South Pacific Design Automation Conference, [10] A. Rajaram and D. Z. Pan, Fast incremental link insertion in clock networks for skew variability reduction, in Proc. IEEE International Symposium on Quality Electronic Design, [11] A. Chakraborty, K. Duraisami, A. Sathanur, P. Sithambaram, A. Macii, E. Macii, M. Poncino, and L. Benini, Dynamic thermal clock skew compensation using tunable delay buffers, in Proc. International Symposium on Low Power Electronics and Design, [12] A. Kapoor, N. Jayakumar, and S. P. Khatri, A novel clock distribution and dynamic de-skewing methodology, in Proc. International Conference on Computer Aided Design, [13] J.-L. Tsai, L. Zhang, and C. C.-P. Chen, Statistical timing analysis driven post-silicon-tunable clock-tree synthesis, in Proc. International Conference on Computer Aided Design, [14] S. Sivaswamy and K. Bazargan, Statistical generic and chip-specific skew assignment for improving timing yield of FPGAs, in Proc. Field- Programmable Logic and Applications, [15], Statistical analysis and process variation-aware routing and skew assignment for FPGAs, ACM Transactions on Reconfigurable Technology and Systems, vol. 1, no. 1, Mar [16] Virtex-4 User Guide, Xilinx Inc., February [17] Virtex-5 User Guide v3.0, Xilinx Inc., February [18] Stratix II Device Handbook, Altera Corp., May [19] Stratix III Device Handbook, Altera Corp., May [20] J. Xiong, V. Zolotov, and L. He, Robust extraction of spatial correlation, in Proc. International Symposium on Physical Design, [21] W. C. Elmore, The transient response of damped linear networks with particular regard to wideband amplifiers, Journal of Applied Physics, vol. 19, no. 1, pp , Jan [22] H. J. Godwin, Some low moments of order statistics, The Annals of Mathematical Statistics, vol. 20, no. 2, pp , Jun 1949.

On the Tradeoff between Power and Flexibility of FPGA Clock Networks

On the Tradeoff between Power and Flexibility of FPGA Clock Networks JULIEN LAMOUREUX AND STEVEN J.E. WILTON University of British Columbia FPGA clock networks consume a significant amount of power since