392 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 3, MARCH 2013

Size: px

Start display at page:

Download "392 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 3, MARCH 2013"

Geraldine Ford
5 years ago
Views:

1 392 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 3, MARCH 2013 An Optimal Allocation Algorithm of Adjustable Delay Buffers and Practical Extensions for Clock Skew Optimization in Multiple Power Mode Designs Kyoung-Hwan Lim, Deokjin Joo, Student Member, IEEE, and Taewhan Kim, Senior Member, IEEE Abstract Satisfying a clock skew constraint is one of the most important tasks in clock tree synthesis. Moreover, the task becomes much harder to solve when the clock tree is designed in a multiple power mode environment, in which the voltage applied to some design module varies as the power mode changes. Recently, it has been shown that an adjustable delay buffer (ADB), whose delay can be tuned dynamically, can be used to solve the clock skew problem effectively under multiple power modes. However, due to the area or control overhead by ADBs, it is very important to minimize the number of ADBs to be allocated. This paper provides a complete solution to the problem of clock skew optimization using ADBs under multiple power modes. We propose a linear-time algorithm that simultaneously solves the problems of computing: 1) the minimum (optimal) number of ADBs to be used; 2) the location where each ADB is to be placed; and 3) the delay value of each ADB to be assigned to each power mode. Experimental results show that, in comparison with the previous work, which iteratively performs the ADB allocation, placement, and value assignment, our integrated algorithm produces consistently better designs for all tested benchmarks; it reduces the numbers of ADBs by 9.27% on average under the skew bound of ps, even with shorter clock latencies compared to that of previous algorithm of ADB allocation, placement, and delay assignment. To make it practically feasible, we also propose a new ADB design technique and systematic algorithmic solutions to address the problems of discrete delay values, slew rate variation, nonzero initial ADB delay, and a possible exploration of ADB resizing. Index Terms Adjustable delay buffer (ADB), cell allocation, clock skew, clock tree synthesis, multiple power modes. Manuscript received March 11, 2012; revised June 3, 2012 and July 24, 2012; accepted August 20, Date of current version February 14, This work was supported in part by the Basic Science Research Program through the National Research Foundation under Grant , by the Center for Integrated Smart Sensors funded by the Ministry of Education, Science, and Technology as the Global Frontier Project (CISS ), and by the Ministry of Knowledge Economy, Korea, under the Information Technology Research Center Support Program supervised by the National IT Industry Promotion Agency (NIPA) under Grant NIPA-2012-H A preliminary version of this paper was presented in [1]. This paper was recommended by Associate Editor C. C.-N. Chu. K.-H. Lim is with Samsung Electronics Company, Ltd., Yongin , Korea ( kh12.lim@samsung.com). D. Joo and T. Kim are with the School of Electrical Engineering and Computer Science, Seoul National University, Seoul , Korea ( jdj@ssl.snu.ac.kr; tkim@ssl.snu.ac.kr). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TCAD I. Introduction IN SYNCHRONOUS circuit design, all sequential elements in the design are synchronized by a unified signal, usually called a clock signal. Ideally, the clock signal should reach all sequential elements at the same time from the clock source. However, in practice, there exists some timing difference between the clock signal paths from the clock source to the sequential elements due to some variations of path lengths and buffer characteristics of the paths. The largest difference among the arrival times of a clock signal is called clock skew, and achieving zero clock skew is a very difficult task. One possible solution is to limit the clock skew to a certain bound that can tolerate all variations caused by the clock skew. Extensive research works on clock tree optimization, such as clock routing, clock buffer insertion or sizing, and wire sizing, have been performed to minimize clock skew (e.g., [2] [8]). A common assumption of those works is that the generated clock tree is to be operated on a single (fixed) power mode condition. For multiple power or voltage mode designs, the clock signal delay on a path may change as the applied power mode (i.e., operating condition) changes. Thus, a clock tree optimized to meet a clock skew constraint on one power mode may violate a clock skew constraint on another power mode. Even though the previous works can consider the clock skew constraint on every power mode, it is highly likely that the resulting clock tree uses a substantially long wirelength or that there exists no clock tree that satisfies the clock skew constraint on every power mode. On the other hand, post-silicon tuning (e.g., [9] [12]), such as inserting adjustable delay buffers (ADBs), is a widely used method for dealing with the timing problem caused by process and environment variations. Because the delay of an ADB can be controlled by its delay control inputs [13], the clock skew variation caused by process variation can be tuned by properly inserting ADBs after the manufacturing stage has been completed. For those works that have used ADBs in clock tree synthesis, their objective is to minimize statistical variation of clock skew under a single power mode domain, and as yet the problem of using ADBs for clock skew optimization in multiple power modes has not been intensively investigated. The idea of using ADBs in multiple power modes is to replace some of normal clock buffers with ADBs so that the /$31.00 c 2013 IEEE

2 LIM et al.: OPTIMAL ALLOCATION ALGORITHM OF ADBs AND PRACTICAL EXTENSIONS 393 clock skew constraint on each power mode can be met. When the power mode changes during execution, for example, from power mode Mode-1 to power mode Mode-2, the delays of ADBs in the clock tree that have been adjusted under Mode-1 are readjusted to meet the clock skew constraint under Mode-2. Since the ADB logic component is much bigger than the normal buffer and it requires control lines, a set of related problems to be solved for the ADB-based clock skew optimization in multiple power modes is: 1) (problem-1) allocating a minimum number of ADBs; 2) (problem-2) finding the normal buffers (or locations) in the clock tree that are to be replaced by ADBs; and 3) (problem-3) determining the delay values of ADBs to be assigned on each power mode. To our knowledge, the works by [14] [16] are the only ones that have considered the use of ADBs to minimize clock skew in multiple power mode domains. The authors of [14] and [15] proposed a linear-time optimal algorithm for solving problem-3 and attempted to solve problems-1 and -2 heuristically in a greedy manner by repeatedly applying the algorithm of problem-3. The work in [16] proposed an efficient algorithm adopting a two-stage approach: performing a topdown ADB insertion followed by performing a bottom-up ADB elimination. Even though the two-stage approach reduces the runtime over the work in [14] and [15], it still does not guarantee an optimality. Moreover, [14] [16] do not address the practically important issues caused by the ADB allocation. Those are: 1) (problem-4) the consideration of ADB s base delay, and 2) (problem-5) the consideration of ADB s output slope change. (As will be discussed later, note that the base delay number of an ADB is closely related to the size of the ADB implementation, whereas the output slope of an ADB dynamically varies according to the delay value of the ADB.) In summary, the key contribution of our work consists of two parts: 1) the design of a linear time optimal algorithm to solve the fundamental problems 1, 2, and 3 simultaneously, and 2) to address the practical problems 4 and 5, the development of systematic and effective postprocessing algorithms based on the optimal solution to the problems 1, 2, and 3. This paper is an extended version of the preliminary work in [1]. The extension includes: 1) providing a formal proof of the optimality of the proposed linear-time ADB allocation algorithm that minimizes the number of ADBs to be used; 2) proposing a technique that considers the discrete delay increment of ADBs; 3) proposing a new ADB design technique to solve the output slew variation problem; 4) proposing a technique to support the nonzero base delay of ADBs; 5) proposing the ADB refinement technique to reduce the total cost of capacitors in the ADBs; and 6) providing a set of diverse experimental data to validate the effectiveness of the proposed techniques of practical extensions. The remainder of this paper is organized as follows. In Section II, we describe the internal logic structure of an ADB and propose an ADB optimization flow in clock tree synthesis. We then propose, in Section III, an integrated algorithm for solving the ADB allocation, placement, and delay assignment problems together with a complete proof of linear-time optimality of the algorithm and handling the discrete delay values of ADBs. In Section IV, we discuss the Fig. 1. Two logic structures of ADB. (a) Capacitor-based structure [17]. (b) Inverter-based structure [16], [18]. characteristics of ADB implementation, propose a new ADB structure to cope with the output slew variation problem, and propose a tuning algorithm to address the ADB s base delay problem, followed by refinement techniques to reduce the total capacitors and transistors of ADBs. We provide in Section V a set of diverse experimental data to validate the effectiveness of our proposed approach. Conclusions are given in Section VI. II. ADB Structure and Synthesis Flow A. ADB Logic Structure The ADB is a buffer that can provide more than one delay that can be dynamically controlled by control inputs. In other words, any design of special buffers that provides various delays would be acceptable as an ADB. Fig. 1 shows two widely used structures of an ADB. Fig. 1(a) shows an ADB implementation based on capacitors [17]. The ADB consists of two inverters: a capacitor bank and a capacitor bank controller. When we assume to use a uniform size of capacitors in the bank in Fig. 1(a), the delay of its ADB is linearly proportional to the number of capacitors activated in the bank. On the other hand, Fig. 1(b) shows an ADB implementation based on inverters [16], [18]. The ADB consists of parallel inverters and their SELECT pins. The SELECT signals provide several driving modes, by which the ADB delay is adjusted. For example, when all the SELECT signals are set to logic 0, the inverters connected to the corresponding SELECT pins are turned off. Thus, only the leftmost and the rightmost inverters in Fig. 1(b) are turned on and drive output. The delay of ADB can be adjusted by setting some of the SELECT signals to logic 1 to activate

3 394 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 3, MARCH 2013 the problem of the output slope change (in Section IV-A) combined with a possible ADB size reduction (in Section IV-C). Fig. 2. Proposed synthesis flow of clock tree optimization using ADBs. their parallel inverters. The more SELECT signals are enabled, the shorter the delay of ADB is. Note that as the maximum delay an ADB needs to generate increases, its inverter-based implementation requires increasing the number of parallel transistors, while its capacitor-based implementation requires increasing the number of capacitors. In addition, the precision of the delay values of an ADB is controlled by the size of the set of parallel transistors for the inverter-based implementation and by the size of capacitor bank for the capacitor-based implementation. The capacitor-based implementation offers the finer granularity of delay values over that of the inverterbased implementation because of the easier fine tuning of delays by using capacitors than using inverters. However, the size of the capacitor-based implementation is relatively larger than that of the inverter-based implementation. For this reason, the inverter-based implementation is suited for the design domains with very high clock skew variation, while the capacitor-based implementation is acceptable to the high-speed designs, in which a delicate delay adjustment is necessary. In this paper, we use the capacitor-based ADBs in Fig. 1(a). B. Synthesis Flow Fig. 2 shows our proposed synthesis flow that generates an ADB-based clock tree. We accept the input clock tree that is constructed by any of traditional clock tree generation schemes. We first find optimal ADB allocations and its delay value assignment under a given constraint of clock skew bound B while using the given arrival time information for every power mode. Our ADB allocation presented in Section III-B guarantees to find a minimum number of ADBs under B.Inthe subsequent step in Section III-C, the optimal results are iteratively explored to support the discrete delay values of ADBs. The last two (yellow) steps in Fig. 2 solve the problem of the initial delay (i.e., base delay) of ADBs (in Section IV-B) and III. ADB Allocation, Placement, and Delay Assignment A. Key Observations The ADB-based clock skew optimization problem is described as follows. Problem 1 (ADB-Based Clock Skew Optimization Problem): Given an initial buffered clock tree T, power modes Mode-1, Mode-2,, Mode-K, clock signal arrival times on all power modes, and clock skew bound B, find the buffers in T to be replaced by ADBs and assign delay values to the ADBs for every power mode such that the number of ADBs used is minimized while the clock skew constraint is satisfied for all power modes. Before we describe the details of the proposed ADB allocation algorithm, we want to illustrate, using clock timing analysis, a number of key observations upon which our algorithm is built. Let us consider the clock tree rooted at a buffered node n 1 in Fig. 3(a), where there are five flip-flops (FFs) E, F, G, H, and I with arrival times in a power mode. We now look into whether the buffer n 2 should be replaced by an ADB, and if so, what delay it should be assigned. First, we introduce some definitions and notations to facilitate our discussion. Definition 1 (Fully Reducible and Fully Simplified Timing Tree): A clock subtree T with arrival time information to the FFs that T has is called fully reducible if the ADB allocation and delay assignment for T has been completed for every power mode. The fully reduced timing tree of a fully reducible subtree T consists of the root of T and two children of the root, which represent the FFs with the latest and earliest arrival times. If there is more than one FF whose arrival time is the latest (or earliest), any of the FFs can be chosen. For example, if the subtree in Fig. 3(b) with two buffers n 1 and n 2 becomes a fully reducible subtree by performing the allocation of ADB only to n 2 with the ADB delay of 3 on a power mode, the fully reduced timing tree of n 1 on the power mode is shown on the right in Fig. 3(b). 1) α i ( 0): the delay increment 1 to be assigned to an ADB in node n i. α i = 0 means no ADB is needed at n i. 2) est i and lst i : the earliest and latest arrival times to the FFs whose clock signals pass through n i. For example, in Fig. 3(a) est 1 =2,est 2 =2,lst 1 = 14, and lst 2 =10. 3) lat max : the latest arrival time to FFs for a clock tree. That is, when n 1 is the root of the clock tree, lat max is another notation equivalent to lat 1. 4) est i\{k1,,k r } and lst i\{k1,,k r }: the earliest and latest arrival times to the FFs whose clock signals pass through n i but not through n k1,, or n kr. If there is no such clock signal, est i\{k1,,k r } = and lst i\{k1,,k r } =0.For example, in Fig. 3(a) est 1\{2} = 5 and lst 1\{2} =14. 5) slk i ( 0): the clock skew of clock tree rooted at n i, i.e., the value of lst i est i. slk i\{k1,,k r } is defined to 1 For conciseness, we also call it delay value when it does not cause any confusion.

4 LIM et al.: OPTIMAL ALLOCATION ALGORITHM OF ADBs AND PRACTICAL EXTENSIONS 395 an ADB is used in n 2. For example, in Fig. 3(d), since the delay value to an ADB is nonnegative, the value of lst max est 1 (= 12 1) can never be reduced. For this case, B should be reset to a number 11. We formalize observations 1 4 in the following three cases. Consider a clock tree with two buffered nodes n 1 and n 2 such that n 1 is the root of the tree and n 2 is a child node of n 1, such as in Fig. 3(a). Suppose all lst ( ), lst ( ), and slk ( ) values of the clock tree before the ADB allocation to n 2 are available. Notation ( ) indicates every node associated with the ADB allocation. For the example in discussion, it corresponds to nodes n 1 and n 2. Now, when an ADB is allocated to n 2 with delay value α 2, the resulting clock skew can be expressed as follows: max{slk 1\{2}, slk 2, lst 2 + α 2 est 1\{2}, lst 1\{2} (est 2 + α 2 )}. (1) Fig. 3. Relations between the ADB allocation and the arrival time distribution (the clock skew bound B is set to 10). (a) Simple clock tree with two buffered nodes n 1 and n 2. (b) Case of arrival time distribution where n 2 should be replaced with an ADB. (c) Case where n 2 needs no ADB. (d) Case where ADB allocation never resolves the clock skew violation. lst i\{k1,,k r } est i\{k1,,k r }. For example, in Fig. 3(a), slk 1 = 12 and slk 2 =8.slk i is a fixed value independent of the delay values assigned to an ADB, if there is, in node n i. The skew bound B is assumed to be set to a value such that B slk i for every node n i, because otherwise no ADB allocation is able to resolve the clock skew violation. Let us examine the problem of determining if we need an ADB in node n 2 in the clock tree rooted at n 1, as shown in Fig. 3(a). With the clock skew bound B = 10, we make the following observations. 1) If an ADB is not allocated to n 2, no matter what delay values to n 1 are assigned, it is always lst max est 2 >B. 2) If an ADB is allocated to n 2, we can make lst max est 2 B by setting a delay increment (e.g., 3 or 4) to the ADB. Thus, we can conclude that an ADB should be assigned to n 2 to satisfy the skew constraint on the power mode. Once a delay value (e.g., α 2 = 3) of the ADB is assigned, the resulting fully reduced (arrival) timing tree is shown on the right side in Fig. 3(b). 3) Let us consider other arrival times, as shown on the left clock tree in Fig. 3(c). Since est 2 is large enough to satisfy the skew constraint, an ADB is not needed at n 2, producing the fully simplified timing tree shown on the right in Fig. 3(c). 4) Last, there could be another arrival time relation in which the skew constraint can never be met even though 1) Case 1 (lst max est 1\{2} B and lst max est 2 >B): Since lst max est 2 >B, it is necessary to set α 2 to a positive value. We set α 2 = est 1\{2} est 2. (2) We can confirm α 2 > 0 by using lst max est 2 > B, slk 2 B, and slk 1\{2} B. 2 The clock skew in (1) is then expressed as max{slk 1\{2}, slk 2, lst 2 est 2, lst 1\{2} est 1\{2} }, which becomes max{slk 1\{2}, slk 2, slk 2, slk 1\{2} } B. 2) Case 2 (lst max est 1\{2} B and lst max est 2 B): With α 2 = 0, implying no ADB allocation at n 2, the skew expression in (1) is simplified to max{slk 1\{2}, slk 2, lst 2 est 1\{2}, lst 1\{2} est 2 }.Iflst max = lst 1\{2}, the clock skew becomes max{slk 1\{2}, slk 2, lst 2 est 1\{2}, lst max est 2 } max{slk 1\{2}, slk 2, lst max est 1\{2}, lst max est 2 } B. Otherwise, i.e., if lst max = lst 2, the clock skew becomes max{slk 1\{2}, slk 2, lst max est 1\{2}, lst 1\{2} est 2 } max{slk 1\{2}, slk 2, lst max est 1\{2}, lst max est 2 } B. Thus, no ADB is needed at n 2. 3) Case 3 (lst max est 1\{2} >B): If lst max = lst 1\{2}, lst max est 1\{2} = lst 1\{2} est 1\{2} = slk 1\{2} >B, which violates the assumption that B slk ( ) for every ( ), whereas if lst max = lat 2, lst 2 + α 2 est 1\{2} in (1) becomes lst max est 1\{2} + α 2 >Bsince α 2 0 and lst max est 1\{2} >B. This means the skew bound B can never be met by ADB allocation. B. Proposed Algorithm Our proposed algorithm called CLK-ADB for the ADBbased cock skew optimization is an iterative one, processing in a bottom-up fashion on a clock tree T. CLK-ADB sorts all the internal buffered nodes, excluding the leaf buffered nodes, in T topologically and performs the following two steps iteratively for the clock subtrees rooted at the nodes in the sorted list. Step 1) Allocating ADBs: Consider allocating ADBs to the nodes in the clock subtree rooted at node n i. At this point, the ADB allocation to all descendant nodes 2 From lst max = {lst 2, lst 1\{2} }, lst max est 2 >B, and lst 2 est 2 B, lst max must be lst 1\{2}. Then, from lst max est 2 >Band lst 1\{2} est 1\{2} B, it is true that est 1\{2} > est 2.

5 396 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 3, MARCH 2013 Fig. 4. Illustration of the two-step ADB allocation of CLK-ADB for a clock signal timing tree rooted at n i. The tree contains two child nodes n k1 and n k2. According to Cases 1, 2, and 3, n k1 belongs to Case 1, needing an ADB with a delay value of 4 to meet the clock skew constraint, while n k2 belongs to Case 2, not needing an ADB. of n i other than the child nodes n k1,,n kr of n i have already been processed during the previous iterations. Thus, all the subtrees rooted at the processed descendant nodes have been replaced with their fully simplified timing trees. Now, CLK-ADB will determine for each child node n kj, j =1,,r whether an ADB is needed or not. The ADB allocation to n kj is determined according to the three cases classified before, by replacing the notations lst max, est 2, lst 2, and est 1\{2} in the inequality relations in Case 1, Case 2, and Case 3 with lst i, est kj, lst kj, and est i\{nk1,,n kr }, respectively. Step 2) Assigning delay values: Let A case1 denote the subset of child nodes n k1,,n kr that were determined in Step 1 to allocate ADBs. The delay value α kj to be assigned to an ADB at each node n kj A case1 is set according to the generalization of (2) α kj = est min est kj (3) where est min = est i\acase1. Note that the path of est min always exists, which otherwise automatically implies A case1 = {n k1,,n kr }, which contradicts slk kj B, which is due to the fact that by the second inequality in Case 1, for some child node n kj, lst max = lst kj. Once the computed delay values are assigned, the corresponding subtrees are reduced to fully simplified timing trees. It should be noted that α kj in (3) is the smallest value that can be assigned as a delay value to the ADB in node n kj to meet the clock skew constraint. This directly implies that a minimal sized capacitor should be allocated in the ADB in the course of ADB allocation. However, it does not always lead to a globally minimal allocation of capacitors of ADBs even though CLK-ADB allocates a minimal number of ADBs. In Section IV-C, we propose an ADB cost refinement technique to further reduce the total cost of capacitors of ADBs. Fig. 4 shows an example of processing ADB allocation by CLK-ADB for a clock signal timing tree rooted at node n i. The tree has two buffered nodes n k1 and n k2. In Step 1, all est ( ) and lst ( ) values are computed, Then, the three cases are checked for each of n k1 and n k2. It turns out that n k1 belongs to Case 1, requiring an ADB to satisfy the clock skew constraint, while n k2 belongs to Case 2, not needing an ADB. In Step 2, the delay value of the allocated ADB is computed by (3) where est min = est i\{k1 } since A case1 = {n k1 }, and the clock timing tree rooted at n i is fully simplified. A complete example: Fig. 5(a) (e) shows step-by-step results of ADB allocation and delay assignment by CLK- ADB for the clock subtrees rooted at node set L = {Buf 0, Buf 1,Buf 2,Buf 5 } with clock skew bound B = 10 in Fig. 5(a). We assume that the topologically sorted list of L is (Buf 5,Buf 1,Buf 2,Buf 0 ). Then, the clock subtree rooted at Buf 5 belongs to the form in Fig. 3(a), in which for Mode-1, lst max = 15, est 5\{7} =6,est 7 = 8 that satisfy Case 2, while for Mode-2, lst max =14,est 5\{7} =5,est 7 = 2 that satisfy Case 1, allocating an ADB at Buf 7 with a delay value of 3 (= est min est 7 =5 2) on Mode-2. Then, the clock tree with the fully simplified timing subtree rooted at Buf 5 is shown in Fig. 5(b). Note that the timing numbers 6 and 5 (or 15 and 14) in G (or H ) indicate the earliest (latest) arrival times of clock signal to FFs passing through Buf 5 for Mode-1 and Model-2, respectively. Next, we look into the clock subtree rooted at Buf 1 in Fig. 5(b). Since there is no clock path on the subtree other than the paths that cover either Buf 3 or Buf 4, est 1\{3,4} = by our definition, from which we can easily check that both of Buf 3 and Buf 4 belong to Case 2 for Mode-1 and Mode-2. The resultant simplified timing tree is shown in Fig. 5(c). Similarly, the clock subtree rooted at Buf 2 is then processed, and the resultant simplified clock tree is shown in Fig. 5(d). Finally, the clock subtree rooted at Buf 0 is processed, by which it is shown that only Buf 1 needs an ADB with a delay increment of 2 on Mode-1 and 3 on Mode-2. The whole clock tree with the ADB allocation and delay assignment is shown in Fig. 5(e), where the nodes with blue color represent ADBs. Note that an ADB is never allocated to the root node Buf 0. Fig. 6 depicts the flow of the proposed ADB allocation algorithm CLK-ADB. During the iterations that check the three cases in Step 1, if there is at least one clock subtree tested that satisfies Case 3 on a power mode, we report that the input clock tree can never meet the clock skew constraint. Otherwise, if all clock subtrees do not satisfy Case 3 on every power mode, but there is a clock subtree that satisfies Case 1 for a child node of the root of the subtree on a power mode, an ADB should be allocated at the child node with delay assignment. Let N and K be the number of buffered nodes in an input clock tree and the number of power modes, respectively. Since each iteration of CLK-ADB takes a constant time for each power mode and there are at most N iterations, the time complexity of CLK-ADB is bounded by O(N K). Since K is a small number, mostly not exceeding 6, CLK-ADB is a linear time algorithm. The optimality of the ADB allocation by CLK-ADB is formally given by Lemma 1 and Theorem 1. Lemma 1: The clock tree produced by CLK-ADB always contains a clock signal path of the earliest arrival time to FF such that no ADBs are allocated on that path. Proof: We use induction on the height (h), in terms of the number of buffered nodes, of clock subtrees. For the case

6 LIM et al.: OPTIMAL ALLOCATION ALGORITHM OF ADBs AND PRACTICAL EXTENSIONS 397 Fig. 5. Example showing step-by-step results by CLK-ADB for processing clock subtrees. (a) Clock tree T before the bottom-up ADB allocation of CLK-ADB with clock skew bound B = 10. The topologically sorted node list is (Buf5, Buf1, Buf2, Buf0 ), and two power modes Mode-1 and Mode-2 are considered. (b) Processing clock subtree rooted at Buf1. (Case 1 holds on Mode-2 for Buf7. Thus, an ADB is allocated at Buf7 with a delay value of α2 = +3 on Mode-2.) (c) Processing clock subtree rooted at Buf2. (No ADBs are needed at Buf3 and Buf4.) (d) Processing clock subtree rooted at Buf0. (An ADB is needed at Buf1.) (e) Complete clock tree T after the ADB allocation of CLK-ADB where the allocated ADBs and delay assignment are shown with blue color.

398 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 3, MARCH 2013 Fig. 6. Flow of CLK-ADB for solving the ADB-based clock skew optimization problem.

7 398 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 3, MARCH 2013 Fig. 6. Flow of CLK-ADB for solving the ADB-based clock skew optimization problem. The green boxes represent the termination of the flow. The outer loop indicates the iteration of CLK-ADB, and the inner loop indicates the testing of each child node of the root of the subtree corresponding to the current iteration for ADB allocation with delay assignment. of h = 1, i.e., the clock subtrees rooted at nodes with no child node, the claim is trivially satisfied since there is no buffered node except for the roots, and thus no ADB allocation. For the case of h = l(> 1), consider a clock subtree rooted at a node ni such that h = l. We claim that the clock signal path corresponding to estmin is a path with: 1) the earliest arrival time to FF, and 2) no ADBs. Property 1 holds from the delay assignment in Step 2 of CLK-ADB that for every child node nkj Acase1, the increased value of estkj by adding αkj in (3) never exceeds the value of estmin, while property 2 holds by the induction hypothesis of the subtrees of height l 1 or below. Theorem 1: CLK-ADB uses a minimal number of ADBs to meet the clock skew constraint for all power modes. Proof: We use induction on height h of clock subtree in terms of the number of buffered nodes. For the case of h = 1, i.e., the clock subtrees rooted at nodes with no child node, the claim is trivially true since there is no ADB on the corresponding subtrees except for the roots. For the case of h = l(> 1), consider a clock subtree rooted at a node ni such that h = l, Let nk1,, nkr be the child nodes of ni, and Nopt (Tx ) and NADB (Tx ) denote an optimal number of ADBs and the number of ADBs used by CLK-ADB for the clock tree rooted at nx, respectively. Since NADB (Tkj ) = Nopt (Tkj ) for all j = 1,, r by the induction hypothesis and a clock skew violation, if it exists, caused by two clock arrival times in different clock subtrees rooted at child nodes requires one or more new ADBs, the optimal ADB relation between Ti and Tkj can expressed as Nopt (Ti ) Nopt (Tk1 ) + + Nopt (Tkr ). (4) More precisely, for each nkj Acase1, it is required to allocate at least one more ADB in the subtree rooted at nkj to resolve the clock skew violation since the clock path of estkj to be increased contains no ADBs at all according to Lemma 1. Thus, the relation of ADB requirement can be expressed as follows: Nopt (Ti ) Nopt (Tk1 ) + + Nopt (Tkr ) + Acase1. (5) On the other hand, for each nkj Acase1, CLK-ADB allocates only one ADB at the root of the subtree, thus NADB (Ti ) = Nopt (Tk1 ) + + Nopt (Tkr ) + Acase1. By (5) and (6), NADB (Ti ) = Nopt (Ti ). (6)

8 LIM et al.: OPTIMAL ALLOCATION ALGORITHM OF ADBs AND PRACTICAL EXTENSIONS 399 C. Consideration of Discrete Delay Values The proposed ADB allocation algorithm and the existing algorithms [14] [16] have initially assumed continuous delay values of ADBs. However, the size of each unit of capacity banks in Fig. 1(a) will determine the delay precision of the ADB. CLK-ADB employs a simple roundoff scheme to handle the discrete delay values of ADBs. However, the roundoff scheme could cause a clock skew violation. We control the potential clock skew violation by intentionally tightening the clock skew bound B. That is, we repeatedly perform the following two-step procedure (called ADB-RD) until there is no clock skew violation: Step 1) apply CLK-ADB with clock skew bound B and roundoff the delay value of ADBs, and Step 2) if there is a clock skew violation, reset B = B B and go to Step 1. The B value is set to a number such that R B = B, where R is the maximum number of iterations (Steps 1 and 2) the designer wants to set. We refer the combined ADB allocation algorithm of CLK-ADB and ADB-RD to CLK-ADB-RD. IV. Coping With Physical Limitations of ADBs Since the characteristics of normal buffers and ADBs are inherently different, replacing some buffers with ADBs to optimize clock tree timing should address a number of issues in addition to the ADB allocation problem in Section III. Those are: 1) maintaining a consistent output slew of ADBs; 2) supporting the nonzero initial (i.e., base) delay of ADBs; and 3) configuring the size of capacitor banks in ADBs. A. Maintaining a Consistent Output Slope The delay of each ADB allocated in the clock tree varies according to the power modes in the system. The predefined delay value of the ADB for a power mode is established by controlling the number of capacitors to be activated in the ADB. However, depending on the number of activated capacitors, the ADB s output slope varies, enabling a very low output slew rate when a long ADB delay is established. We upgrade the conventional capacitor-based ADB circuit element by adding a new circuit consisting of six transistors to cope with the low output slew rate problem. The devised circuit is shown in Fig. 7(b), in which both of the input and output signals of the original ADB in Fig. 7(a) are used as input to the new circuit, and the left four transistors of the newly added circuit in Fig. 7(b) generate an inverted signal only when the two input signals switch to the same logic state while the right two transistors form a normal inverter, producing the correct output. The new ADB circuit elements work as follows. Let us suppose the two inputs (i.e., original and delayed inputs) and output signals are all in logic state 0. When the original input signal switches from 0 to 1, the output signal still maintains state 0 because the left four transistors in the added circuit do not let the output be in state 1. Then, as the input signal passes through the capacitor bank for the time duration of the predefined delay value, both of the original input and delayed input signals become in state 1, which subsequently leads to state 0 to the output of the left four transistors, which then produces state 1 at the final output. The occurrence Fig. 7. (a) Proposed updated ADB structure, maintaining a consistent output slew rate that is immune to the ADB delay variation in the (b) conventional capacitor-based ADB structure. of a low slew rate at the output of the conventional ADB implementation due to a long delay will be prevented by attaching the new circuit elements to the ADB. Note that to maintain the output slew of some design components other than ADB, a technique similar to our proposed one in Fig. 7(b) is commonly used in the design field. B. Supporting Nonzero Base Delay If an ADB allocated in the clock tree is not used in a power mode, that is, delay increment α = 0, the ADB is ideally assumed to be behaving exactly as the buffer that has been previously placed on the same location. However, when all the capacitors in the capacitor bank of the ADB are turned off, the delay is actually more than the buffer delay, i.e., the delay increment of ADB is not zero. This problem is especially serious for the large ADBs that are used to adjust delay over hundreds picoseconds. From the experiment, it is shown that the nonzero-based delay caused by the parasitic effect is not more than 5 ps for small ADBs that are used to adjust delay up to 40 ps with 25.6 ff loading capacitance. However, ADBs that adjust delay up to 200 ps with the same loading capacitance induce about 40 ps base delay even if all capacitors in the capacitor bank are turned off. Consequently, the substantially large base delay of ADBs could distort the clock skew calculation that we performed in the previous section. Rather than attempting to design an ADB that eliminates its parasitic effect completely, we propose a path-based greedy algorithm that can systematically refine the ADB allocation result. Our ADB refinement procedure is performed in the following four steps.

9 400 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 3, MARCH ) Compute the actual arrival times to FFs in the clock tree. If there is at least one power mode that violates the skew constraint, i.e., slk root >Bdue to the nonzero base delay of ADBs, go to step 2. 2) Let lst slk and est slk be the latest and earliest arrival times of a power mode such that slk root is the largest. If slk root B, stop. Otherwise, go to step 3 to increase the delay on the clock path of est slk. 3) Look into the ADBs on the path of est slk from the top to bottom nodes of the clock tree. For each ADB, we want to increase its delay value for the power mode without considering resizing of ADBs. a) If there is no ADB from the current ADB to the bottom ADB that can increase delay without resizing, go to step 4. b) Let D be the maximum delay to be increased by the current ADB without resizing. Increase the ADB s delay value (within up to D) until: i) the clock skew violation of slk root is resolved, or ii) lst 3 slk is increased. For i), go to step 1. For ii), repeat step a) for the next ADB. 4) Look into the ADBs and normal buffers (assuming ADB with zero delay) on the path of est slk from the top to bottom nodes. For each ADB, we want to increase its delay value for the power mode considering resizing of ADBs. a) Increase the ADB s delay value by resizing the ADB until: i) the clock skew violation of slk root is resolved, or ii) lst slk is increased. For i), go to step 1. For ii), repeat step 4a) for the next ADB. Note that it is guaranteed that the repeated application of step 4a) is eventually terminated by executing i). Steps 1 and 2 compute the actual clock delay and identify the clock signal path whose delay should be increased to resolve the worst clock skew violation. Step 3 then attempts to increase ADBs delay values without resizing the ADBs, thus causing no area overhead, while Step 4 tries to increase ADBs delay by resizing the ADBs. Note that finding a polynomial time algorithm or proving the NP completeness of the nonzero base delay problem is another nontrivial task to be tackled. It looks that it is neither a simple nor extendable work. This is because intuitionally the zero base delay assumption offers that allocating ADBs to a clock tree in a power mode to fix the clock skew violation does not affect the result of allocating ADBs in another power mode. However, the nonzero base delay means to consider the ADB allocation problem in all power modes all together. Besides the algorithmic solution to the nonzero base delay problem, it may also be solved by applying some analog design skills, for example, upscaling the front and back inverters in ADB properly. 3 Note that as the delay value of ADB is increased in Step 3b) or Step 4a), the clock path corresponding to lst slk may be changed due to the increase of its delay. For this case, the increase of the delay value stops at the time when lst slk starts to increase. C. Discussion on Further Reducing ADB Cost Even though CLK-ADB uses a minimal number of ADBs, the total implementation cost of ADBs may not be minimal because the area of an ADB is proportionally determined by the largest delay value that the ADB should establish. Furthermore, the proposed new ADB structure increases the implementation cost. We apply the following two simple methodologies called Method A and Method B to explore the possibility of further reducing the total cost of ADB implementations. We apply Method A first to reduce the total cost of capacitor banks in ADBs, followed by applying Method B to reduce the total cost of transistors in ADBs by selectively using our newly designed ADBs. 1) Method A: Maximally Trading Large Capacitors With Small Ones: A.1) Pick the ADB A i with the largest size. Then, find the ADBs that are in the timing dependence relation with A i, and sort the ADBs from the smallest size to the largest. A.2) Let C be the size of the unit capacitor bank. Check if reducing the capacitor size of A i by C causes a clock skew violation. If it does, go to step A.3. Otherwise, reduce the capacitor size of A i by C and repeat step A.2. A.3) Check if reducing the capacitor size of A i by 2 C and increasing the capacitor size of some of the sorted list by C cause a clock skew violation. If it does, A i is excluded from the consideration of reducing size and go to step A.1. Otherwise, reduce the capacitor size of A i by 2 C and increase the capacitor size of the selected ADB by C, and repeat step A.3. 2) Method B: Minimally Replacing ADBs With Newly Designed ADBs: B.1) Collect all the ADBs allocated by CLK-ADB-RD which violate the slew constraint. If there are no ADBs collected, stop. Otherwise, pick the ADB A i with the lowest slew rate. B.2) Replace A i with the newly designed ADB in Fig. 7(b). This may also lead to satisfy the slew constraint on some of the collected ADBs. Go to step B.1. V. Experimental Results A. Experimental Setup We implemented our proposed algorithm CLK-ADB and the algorithm in [14] using C++ on a system with GHz Intel Xeon CPU with 8 GB memory. All input clock trees are generated using Synopsys IC Compiler with the 45- nm Nangate Open Cell Library. As a clock tree synthesis parameter, maximum load capacitance was set to 51.2 ff, and buffer sizing was disabled. The buffers in the resulting clock trees had 18.8 fan-outs on average. The wire segments between buffering elements had a lumped resistance of 70 and a lumped capacitance of 8.4 ff on average. We tested eight benchmark circuits that are composed of three ISCAS 95 benchmarks, three ITC 99 benchmarks, and two ISPD 09 benchmarks. For each benchmark, we partitioned it into six to ten power subdomains, each of which can operate in two

10 LIM et al.: OPTIMAL ALLOCATION ALGORITHM OF ADBs AND PRACTICAL EXTENSIONS 401 TABLE I Comparison of Results Produced by the Previous ADB Allocation [14] and Our CLK-ADB Using a Simple Roundoff Scheme and Our ADB-RD Skew Using Simple Roundoff Scheme Combining Our ADB-RD Original Bound [14] CLK-ADB [14] + ADB-RD CLK-ADB-RD Circuit #FFs #Bufs Skew Latency B #ADBs Skew #ADBs Skew #ADBs/Area Skew Ctrl WL #ADBs/Area Skew Ctrl WL (ps) (ps) (ps) (ps) (ps) (ps) (μm) (ps) (μm) / / s / / / / / / s / / / / / / s / / / / / / b / / / / / / b / / / / / / b / / / / / / f / / / / / / f / / / / Area indicates the total size of ADBs in μm 2. On average, 9%, 11%, and 10% reductions of the total number of ADBs, the total ADB area, and the control wire length are achieved, respectively, by CLK-ADB. Fig. 8. Numbers of ADBs used by CLK-ADB with ADB-RD as the clock skew constraint is relaxed. different voltage levels 0.95 V and 1.1 V. We assumed to use four different power modes and found the worst clock skew over the whole power modes. We have also assumed that each ADB can be adjusted with a granularity of 10 ps. The experiment associated with the real implementation of ADB has been done by using HSpice of version B. Assessing the Performance of CLK-ADB and ADB-RD Table I summarizes the results produced by the previous ADB allocation algorithm in [14] and our CLK-ADB in Section III-B by using a simple roundoff scheme applied to each ADB independently or integrating our roundoff scheme ADB-RD in Section III-C to support the discrete delay value of ADB. The first column shows the tested benchmark circuits: ISCAS 95 benchmarks s35932, s38417, and s38584, ITC 99 benchmarks b17, b18, and b22, and ISPD 09 benchmarks f31 and f34. The second, third, fourth, fifth, and sixth columns show the numbers of FFs, buffers in each tested clock tree, the worst clock skew, the maximum latency extracted from IC Compiler, and the skew bound B, respectively. The seventh and eighth columns show the results produced by the ADB allocation algorithm in [14] and CLK-ADB, followed by a simple roundoff to each ADB. Notice that the number of ADBs used in CLK-ADB is minimal. However, both algorithms fail to meet the clock skew bound when the simple roundoff scheme is applied. On the other hand, the last six columns show the results produced by the ADB allocation algorithm in [14] and CLK-ADB, combining our CLK-RD to meet the clock skew constraint. The comparison shows that CLK-ADB with ADB-RD uses consistently fewer number of ADBs and smaller total wirelength to control ADBs than that by [14] with ADB-RD while satisfying the clock skew constraint. The wirelength is estimated by finding a minimum spanning tree on a graph formed with ADBs and the half-perimeter wirelength between ADBs. Precisely, the ADB reduction is about 10% on average for a clock skew bound of 30 ps 50p with even shorter clock latencies. For example, for s35932, [14] requires 60 ADBs to meet a skew bound constraint of 30 ps under all power modes, whereas CLK-ADB uses 55 ADBs to meet the skew constraint. It should be noted that we could not compare the performance of the previous ADB allocation algorithm by [16] with ours because it assumes a different ADB structure from ours and it does not describe the discrete characteristics of ADB clearly; it does not even specify a scheme or algorithm to support the discrete delay values of ADB. Fig. 8 shows the numbers of ADBs used by CLK-ADB combined with ADB-RD as the clock skew constraint changes for s35932, s38417, and s We can see that there is a good tradeoff between the skew bound and the number of ADBs, which implies that CLK-ADB-RD can be used to find

11 402 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 3, MARCH 2013 TABLE II Transition Delay Table of ADB With Five-Stage Capacitor Bank Number of Activated Unit Capacitors Load Voltage Input Cap. Rise Fall Rise Fall Rise Fall Rise Fall Rise Fall Rise Fall Transition (ff) Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) ps V ps ps V ps Fig. 9. ADB allocation and placement result by CLK-ADB-RD for design B17 at B = 50 ps in Table I. Large blue and small red dots indicate ADBs and buffers, respectively. CLK-ADB-RD allocates 13 ADBs, whereas the algorithm in [14] allocates 15 ADBs. alternative solutions in terms of skew bound and the number of ADBs. Fig. 9 shows the allocation and placement by CLK- ADB-RD for B17 under B = 50 ps. The dots represent the buffer locations, among which the blue dots represent the ADBs allocated by CLK-ADB-RD where a total of 13 ADBs are allocated whereas 15 ADBs are allocated in [14]. C. Validation of the Practical Characteristics of ADBs 1) Nonuniform Discrete Delay Variation: To extract the real timing characteristics of ADBs, we have implemented and experimented ADBs using HSpice We initially implemented a conventional ADB that consists of two inverters: a capacitor bank and its controller. Table II summarizes the timing results of an ADB composed of a five-stage capacitor bank. We implemented each switch and capacitor in the capacitor bank with minimum size nmos transistors. The first column indicates the voltage applied to the ADB, and the second column indicates the input transition time to the ADB. The third column shows the four different load capacitances that we used. The pairs of the remaining columns show the delays for the rising edge and the falling edge when we turn on none (0), one (1), two (2), three (3), four (4), and all five (5) switches in the capacitor bank. The results indicate that the discrete delay value of ADB varies (i.e., not a constant precision) according to the input transition time, the load capacitance, and the operating voltage. As an example of an extreme case, the delay of ADB is changed from ps to ps with 1.92 ps (small) granularity when the input transition time is 37.5 ps and the load capacitance is 0.4 ff under a 1.1 V operating voltage, while the delay of ADB is changed from 1216 ps to 1242 ps with 5.2 ps (large) granularity when the input transition time is 150 ps and the capacitance is ff under a 0.95 V operating voltage. This nonuniform delay granularity of ADB does not obey the assumption used in [1], [14], [15], in which a uniform granularity of 10 ps is applied. Note that our CLK-ADB-RD has no restriction on the delay granularity. We implemented various kinds of ADBs with 5 50 internal unit capacitors and switches, and created the ADB delay table through HSpice simulation. Then, we estimated the delays of all possible configurations by interpolation based on the table, and used the delays to check the effectiveness of our practical extensions of CLK-ADB-RD. 2) Nonuniform Output Slope Variation: Table III shows the output slope changes when we select different sized ADB and activate different numbers of capacitors. In the experiment, we set the input transition time to 150 ps, load capacitance to 25.6 ff, and operating voltage to 0.95 V. Then, we changed the number of activated capacitors in the circuit and observed the changes of output slew of the circuit where the slews are measured at We implemented an internal capacitor bank with minimum sized nmos transistors, Then, we controlled the number of capacitors and switches to change the size of ADB and its adjustable delay range. The first column in Table V-B indicates the numbers of unit capacitors in the internal capacitor bank in ADB. The next five pairs of columns show the variation of output slews for the rising

12 LIM et al.: OPTIMAL ALLOCATION ALGORITHM OF ADBs AND PRACTICAL EXTENSIONS 403 TABLE III Output Slew Variation of Traditional ADB Structure Rates of Activated Capacitors 0% 20% 40% 80% 100% Variation (%) Size Slew Slew Slew Slew Slew Slew Slew Slew Slew Slew of (Rising) (Falling) (Rising) (Falling) (Rising) (Falling) (Rising) (Falling) (Rising) (Falling) Rising Falling ADB (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) Supply voltage of 0.95 V, input transition of 150 ps, and load capacitance of 6.4 ff are applied to the ADB in Fig. 7(a). As the slew varies depending on load capacitance, ADB inverter sizing, and ADB capacitor bank sizing, the designer should control such parameters to balance the rising and falling slews. edge and the falling edge when the numbers of activated unit capacitors are 0%, 20%, 40%, 80%, and 100%. The last two columns show the relative difference between the cases when all switches are turned off and when all switches are turned on for the rising and falling edges. The results reveal that the output slew varies depending on the number of activated capacitors in the capacitor bank. For the ADB with 50 unit capacitors, the output slew is ps for the rising edge and 185 ps for the falling edge when all switches connected to capacitors are turned off. Moreover, the output slew increases as the number of activated capacitors increases. For example, the output slew of ADB with 50 unit capacitors rises up to ps for the rising edge and ps for the falling edge. As shown in the table, for ADBs with less than or equal to ten unit capacitors, the variation of output slew is within 10% for both the rising and falling edges, which we accept the variation as tolerable variation in our experiment. We also call the ADBs with tolerable variation small-sized ADBs, which will be used in the ADB cost reduction procedure (Method B in Section IV-C). Table V-C2 shows the result of output slew variation obtained by simulating the newly designed ADBs in Fig. 7(b). It is verified that all the output slews are small and uniform. 3) Nonzero Base Delay: Another characteristic of the ADB implementation is the nonzero base delay increment due to the parasitic effect. This invalidates the assumption that the ADB delay can be set to 0 when it is not used for a power mode. We have implemented the new ADB structure in Fig. 7(b) and compared its base delay with a normal buffer. Fig. 10 shows the changes of the base delay of an ADB as the range of capacitor bank in the ADB changes when the load capacitance of 25.6 ff is assumed. The base delay proportionally increases with the size of ADB, which determines the parasitic resistance and capacitance. This nonzero base delay may cause a clock skew violation. For example, an ADB implementation whose delay should increase up to ps under 25.6 ff load capacitance will inherently entail a 60.2 ps base delay. Then, consider the case when the ADB delay of ps adjusted under power mode 1 needs to be readjusted into 0 ps under power mode 2, because the ADB will not be used in power mode 2. One solution might be to install the ADB and one buffer in Fig. 10. Change in base delay of ADB as the range of capacitor bank in ADB changes under load capacitance = 25.6 ff. parallel with transmission gate to select one of them in a power mode. Instead, in this paper, we have proposed a greedy ADB refinement algorithm in Section IV-B. D. Assessing the Performance of Practical Extensions of CLK- ADB-RD 1) CLK-ADB-RD Supporting Base Delay: Table V summarizes the results produced by the CLK-ADB-RD without and with the use of our refinement algorithm (specified as refalg in the table) in Section IV-B to support the practical issue of nonzero base delay. The comparison shows that without considering the nonzero base delay, the resultant clock skew greatly exceeds the clock skew bound B for all test cases while our proposed ADB refinement algorithm enables to meet the clock skew constraint for all test cases under the nonzero base delay, but there is the area penalty of 18% on average. 2) CLK-ADB-RD Supporting Slew Variation and Refining ADB Cost: Fig. 11(a) and (b) shows the distribution of ADB sizes (in terms of the number of unit capacitors) produced by the ADB cost refinement technique Method A proposed in Section IV-C for test case s35932 under 30 ps and 40 ps clock skew constraints, respectively. The distributions show that the majority of the ADBs (82% 86%) is the ones with not more than ten unit capacitors. Fig. 12 shows the (normalized) ADB cost optimized by applying Method A followed by Method B to the ADB allocation results produced by CLK-ADB-RD, in which the

(Falling) (Rising) (Falling) (Rising) (Falling) (Rising) (Falling) (Rising) (Falling) Rising Falling ADB (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) 5 120.0 127.7 120.0 127.7 120.0 127.7 120.1 127.

13 404 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 3, MARCH 2013 TABLE IV Output Slew Variation of a New ADB Structure Rates of Activated Capacitors Variation (%) 0% 20% 40% 80% 100% Size Slew Slew Slew Slew Slew Slew Slew Slew Slew Slew of (Rising) (Falling) (Rising) (Falling) (Rising) (Falling) (Rising) (Falling) (Rising) (Falling) Rising Falling ADB (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) Supply voltage of 0.95 V, input transition of 150 ps, and load capacitance of 6.4 ff are applied to the ADB in Fig. 7(b). As the slew delays vary depending on load capacitance, ADB inverter sizing, and ADB capacitor bank sizing, the designer should control such parameters to balance the rising and falling slews. TABLE V Comparison of Results Produced by CLK-ADB-RD Assuming Base Delay = 0 and CLK-ADB-RD Combining the Refinement Algorithm in Section IV-B to Support the Nonzero Base Delay Original Skew CLK-ADB-RD CLK-ADB-RD+ref-alg Increased Circuit #FF #buf Skew Bound B Skew Skew Area (ps) (ps) (ps) (ps) (%) S S S B B B Avg Fig. 11. ADB distribution in terms of the number of unit capacitors for s35932 (a) under 30 ps and (b) under 40 ps skew constraints, respectively. initial ADB cost is measured by the total area of ADBs allocated by CLK-ADB-RD using the new designed ADBs exclusively for the output slew conservation. However, the greedy redistribution of capacitors among ADBs by Method A enables Method B to minimally use the newly designed ADBs to meet the output slew constraint. The overall area reduction is 29.5% on average. Fig. 12. (Normalized) total area of ADBs after the application of cost reduction technique Method A followed by the selective ADB replacement technique Method B. VI. Conclusion This paper proposed a complete solution to the problem of clock skew minimization using ADBs under multiple power modes. We proposed a linear-time algorithm that simultaneously solved the problems of computing the minimum (optimal) number of ADBs to be used, the location at which each ADB is to be placed, and the delay value of each

LIM et al.: OPTIMAL ALLOCATION ALGORITHM OF ADBs AND PRACTICAL EXTENSIONS 405 ADB to be assigned to each power mode.

of ADB, and a possible exploration of ADB resizing, which have not been completely addressed by the previous works.

multiple power modes that support diverse platforms or applications [19], [20]. References [1] K.-H. Lim and T.

14 LIM et al.: OPTIMAL ALLOCATION ALGORITHM OF ADBs AND PRACTICAL EXTENSIONS 405 ADB to be assigned to each power mode. To be practically feasible, we also proposed a new ADB design technique and systematic algorithmic solutions to address the problems of discrete delay values, slew rate variation, nonzero base delay of ADB, and a possible exploration of ADB resizing, which have not been completely addressed by the previous works. Through extensive experiments, it was confirmed that our proposed ADB allocation flow was able to provide a practically useful solution to the clock skew optimization problem for the designs with multiple power modes that support diverse platforms or applications [19], [20]. References [1] K.-H. Lim and T. Kim, An optimal algorithm for allocation, placement, and delay assignment of adjustable delay buffers for clock skew minimization in multi-voltage mode designs, in Proc. IEEE Asia South- Pacific Design Autom. Conf., Jan. 2011, pp [2] C. J. Alpert, A. Devgan, and S. T. Quay, Buffer insertion with accurate gate and interconnect delay computation, in Proc. ACM/IEEE Design Autom. Conf., Jun. 1999, pp [3] J. Cong, C. Koh, and K. Leung, Simultaneous buffer and wire sizing for performance and power optimization, in Proc. IEEE/ACM Int. Symp. Low Power Electron. Design, Aug. 1996, pp [4] C. C. N. Chu and M. D. F. Wong, An efficient and optimal algorithm for simultaneous buffer and wire sizing, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 18, no. 9, pp , Sep [5] I.-M. Liu, T.-L. Chou, A. Aziz, and M. D. F. Wong, Zero-skew clock tree construction by simultaneous routing, wire sizing and buffer insertion, in Proc. ACM Int. Symp. Phys. Design, 2000, pp [6] T. Okamoto and J. Cong, Buffered Steiner tree construction with wire sizing for interconnect layout optimization, in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design, Nov. 1996, pp [7] J.-L. Tsai, T.-H. Chen, and C.-P. Chen, Zero skew clock-tree optimization with buffer insertion/sizing and wire sizing, IEEE Trans. Comput.- Aided Des. Integr. Circuits Syst., vol. 23, no. 4, pp , Apr [8] K. Wang, Y. Ran, H. Jiang, and M. Marek-Sadowska, General skew constrained clock network sizing based on sequential linear programming, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 24, no. 5, pp , May [9] S. Hu and J. Hu, Unified adaptivity optimization of clock and logic signals, in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., Nov. 2007, pp [10] V. Khandelwal and A. Srivastava, Variability-driven formulation for simultaneous gate sizing and post-silicon tunability allocation, in Proc. ACM Int. Symp. Phys. Des., 2007, pp [11] J.-L. Tsai and L. Zhang, Statistical timing analysis driven post-silicontunable clock-tree synthesis, in Proc. IEEE/ACM Int. Conf. Comput.- Aided Des., Nov. 2005, pp [12] E. Takahashi, Y. Kasai, M. Murakawa, and T. Higuchi, A post-silicon clock timing adjustment using genetic algorithms, in Proc. Symp. Very Large Scale Integr. Circuits, Jun. 2003, pp [13] S. Tam, S. Rusu, U. Nagarji Desai, R. Kim, J. Zhang, and I. Young, Clock generation and distribution for the first IA-64 microprocessor, IEEE J. Solid-State Circuits, vol. 35, no. 11, pp , Nov [14] Y.-S. Su, W.-K. Hon, C.-C. Yang, S.-C. Chang, and Y.-J. Chang, Value assignment of adjustable delay buffers for clock skew minimization in multi-voltage mode designs, in Proc. IEEE/ACM Int. Conf. Comput.- Aided Design, Nov. 2009, pp [15] Y.-S. Su, W.-K. Hon, C.-C. Yang, S.-C. Chang, and Y.-J. Chang, Clock skew minimization in multi-voltage mode designs using adjustable delay buffers, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 29, no. 12, pp , Dec [16] K.-Y. Lin, H.-T. Lin, and T.-Y. Ho, An efficient algorithm of adjustable delay buffer insertion for clock skew minimization in multiple dynamic supply voltage designs, in Proc. IEEE Asia-South Pacific Des. Autom. Conf., Jan. 2011, pp [17] A. Kapoor, N. Jayakumar, and S. P. Khatri, A novel clock distribution and dynamic de-skewing methodology, in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., Nov. 2004, pp [18] G. N. Roberts, Adjustable buffer driver, U.S. Patent , [19] E. H. Nam, K. S. Choi, J.-Y. Choi, H. J. Min, and S. L. Min, Hardware platforms for Flash memory/nvram software development, J. Comput. Sci. Eng., vol. 3, no. 3, pp , Sep [20] K. Deray and S. J. Simoff, Designing technology for visualisation of interactions on mobile devices, J. Comput. Sci. Eng., vol. 3, no. 4, pp , Dec Kyoung-Hwan Lim received the B.S. degree in electrical engineering from the Korea Advanced Institute of Science and Technology, Daejeon, Korea, in 2004, and the M.S. and Ph.D. degrees in electrical and computer engineering from Seoul National University, Seoul, Korea, in 2007 and 2012, respectively. He is currently with the System LSI Division, Samsung Electronics Company, Ltd., Yongin, Korea. His current research interests include clock tree optimization, high-level synthesis, and timing closure. Deokjin Joo (S 11) received the B.S. degree and the M.S. degree in electrical engineering from Seoul National University, Seoul, Korea, in 2009 and 2011, respectively. He is currently pursuing the Ph.D. degree with the School of Electrical Engineering and Computer Science, Seoul National University. His current research interests include clock tree synthesis for low-power and thermal-resilient design. Taewhan Kim (SM 08) received the B.S. degree in computer science and statistics and the M.S. degree in computer science from Seoul National University, Seoul, Korea, and the Ph.D. degree in computer science from the University of Illinois at Urbana- Champaign, Urbana, in He is currently a Professor with the School of Electrical Engineering and Computer Science, Seoul National University. After graduation, he was with Lattice Semiconductor Corporation and Synopsys, Inc., San Jose, CA, for six years, specializing in design automation tool development. He has published over 160 technical papers in international journals and conferences, including the IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, the IEEE Transactions on Very Large Scale Integration (VLSI) Systems, ACM TODAES, DAC, ICCAD, and ASPDAC. His current research interests include computer-aided design of integrated circuits ranging from architectural synthesis to physical designs, specifically focusing on power, thermal, noise, reliability, and 3-D integrated circuit design issues. Dr. Kim is the Editor-in-Chief of the International Journal of Computing Science and Engineering.

An Optimal Algorithm of Adjustable Delay Buffer Insertion for Solving Clock Skew Variation Problem

An Optimal Algorithm of Adjustable Delay Buffer Insertion for Solving Clock Skew Variation Problem Juyeon Kim 1 juyeon@ssl.snu.ac.kr Deokjin Joo 1 jdj@ssl.snu.ac.kr Taewhan Kim 1,2 tkim@ssl.snu.ac.kr 1