Representative Path Selection for Post-Silicon Timing Prediction Under Variability

Size: px

Start display at page:

Download "Representative Path Selection for Post-Silicon Timing Prediction Under Variability"

Nora Webster
5 years ago
Views:

1 Representative Path Selection for Post-Silicon Timing Prediction Under Variability Lin Xie and Azadeh Davoodi Department of Electrical & Computer Engineering University of Wisconsin - Madison {lxie2, adavoodi}@wisc.edu ABSTRACT The identification of speedpaths is required for post-silicon (PS) timing validation, and it is currently becoming timeconsuming due to manufacturing variations. In this paper we propose a method to find a small set of representative paths that can help monitor a large pool of target paths which are more prone to fail the timing at PS stage, to reduce with the validation effort. We first introduce the concept of effective rank to select a small set of representative paths to predict the target paths with high accuracy. To handle the large dimension and degree of independent random parameter variations, we then allow modeling target path delays using segment delays and formulate it as a convex problem. The identification of segments can be incorporated in design of custom test structures to monitor PS circuit timing behavior. Simulations show that we can use the actual timing information of less than 100 paths or segments to accurately predict up to 3,500 target paths (statisticallycritical ones) with more than 1,000 process variables. Categories and Subject Descriptors B.7.2 [Integrated Circuits]: Design Aids General Terms Algorithms, Design Keywords Post-Silicon Validation, Process Variations 1. INTRODUCTION In the presence of deep submicron electrical issues and process variations, post-silicon timing validation is becoming significantly expensive and time-consuming [6]. Among the related literature, [1] proposes a statistical learning approach to predict timing failures that might occur on target speedpaths. This prediction is with the aid of measuring the delays of a small set of representative paths. However, [1] does not discuss selection of these representative paths. This research is supported by National Science Foundation under award CCF Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2010, June 13-18, 2010, Anaheim, California, USA. Copyright 2010 ACM ACM $ To help identify representative paths, [3] proposes a technique which relies on defining a set of basic features (e.g., types of logic gates) to rank and cluster the target speedpaths. This helps to define a smaller subset of representative ones to be used for timing failure prediction. However, it is not clear to what extent these features can bind the paths to their representative ones in the presence of variations. Another related work is [7] which synthesizes a representative speedpath so that its delay highly correlates with the circuit delay. By directly measuring the delay of this representative path at post-silicon, the chip frequency can be predicted. However, this approach cannot localize the timing failure. In this paper, we propose to monitor a large pool of target paths, which are more likely to fail timing at post-silicon stage. We study the identification of a set of representative paths from this set of target paths at design stage such that special-purpose test structures or flipflops can be embedded in the circuit to allow their post-silicon measurement. These measured delays will be used to predict the post-silicon delays of a large pool of target speedpaths. The goal is to select of a minimum number of representative paths such that their delays would highly correlate with the delays of target speedpaths. This helps to better localize those speedpaths that may fail their timing requirements. We assume the source of delay uncertainty at the post-silicon stage is parameter variations (not other electrical issues). The challenge is knowledge of actual parameter variation values. To further reduce the post-silicon validation effort, we allow selection of path segments. Segment selection can be useful in presence of large number of independent random variations to reduce the overall number of post-silicon measurements. Even if segment delay mays not be directly measured, their identification can be incorporated in design of custom test structures to predict the post-silicon behavior. Our contributions in this paper are enumerated below: 1. Given a high degree of independent random variations and large number of target paths we discuss a technique which can decrease the number of representative paths. This is based on the idea of effective rank of the transformation matrix between the process parameter variations and target path delays. 2. We discuss a convex formulation to select minimum number of segments to predict the target paths. Simulations show that for over 1,000 random variations and up to several thousand target paths, at most 100 paths or segments are needed for delay prediction. Even though we still have a prediction error, we show that a guard-band for post-silicon timing analysis to be very small, and therefore our framework can be used to accelerate timing validation.

2 G1 G2 G3 G4 G5 G6 G7 G8 G9 Figure 1: Only three of the designated paths (merging at G5) are sufficient to predict the delay of the fourth one with zero error. 2. PRELIMINARIES AND MOTIVATION Preliminaries: Given a set of target paths P tar = {p 1, p 2,..., p n}, we aim to derive their delays at the post-silicon stage, which are denoted by d Ptar = [d p1,..., d pn ]. Let us define process variations by x = [x 1,..., x m], and assume that the actual values of x for a fabricated chip are unavailable. Each entry x i can fall into one of the variation categories: die-to-die, within-die and random variations. A die-to-die variation is common to all the paths. A within-die variation is shared among a group of interconnects or gates that are in the same physical proximity on the chip. Note that we can decorrelate the within-die variations using existing approaches such as the hierarchical spatial correlation model in [2]. A random variation is specific to a gate or interconnect that belongs to at least one of the n target paths. Therefore, the number of random components can be very large in general. All entries x i are independent from each other and are typically Gaussian distributed with mean 0 and variance 1. We model the path delays as a linear function of the parameters variations similar as [2]: G1 G2 d Ptar = µ Ptar + Ax, (1) where µ Ptar is a vector representing the nominal delays of all paths in the set P tar and the matrix A captures the linear transformation of process parameter variables to the delays of the target paths. For the i-th path delay, we have d pi = µ pi + n i j=1 aijxj, where µ pi is the nominal path delay and a ij = 0 if the j-th parameter variation is not related to one or more gates or interconnects on path i. Otherwise, a ij is the sensitivity of d pi with respect to x j. Next, we introduce the notation for defining the delay of a segment of a path as it will be needed in future sections. Given the graph representation of the set of target paths, a segment is the union of those consecutive edges in the paths that do not have any incoming or outgoing edges in between. For a vector of n S segments S = {s 1, s 2,..., s ns }, we express its delay d S and Eqn (1) as: d S = µ S + Σx, d Ptar = Gd S = Gµ S + GΣx, (2) which indicates A = GΣ and µ Ptar = Gµ S. Motivation: Given a set of target paths P tar, we aim to select a minimum-sized subset of paths from P tar to form a set of representative paths, which we denote by P r = {p 1, p 2,..., p r}. The delays of these representative paths will be measured at post-silicon stage and used to predict the delays of the remaining paths in P tar to reduce the postsilicon validation effort. Take Figure 1 as an example. Here, we consider four paths: p 1 : G1 G3 G5 G7 G9, p 2 : G1 G3 G5 G6 G8, p 3 : G2 G4 G5 G6 G8, and G3 G4 G5 G6 G7 G8 G9 p 4 : G2 G4 G5 G7 G9. The graph-based representation of the subcircuit containing these four paths is given on the right of Figure 1. We can write an exact expression of the delay of each path as a linear combination of the remaining three paths due to the common segments shared among these paths. For example, d p1 = d p2 d p3 + d p4. It indicates that we can form a set of representative paths P r = {p 2, p 3, p 4} to model the delays of the remaining paths with zero error. However, the dimension of P r can still be very large (3 out of 4). As we will show in this paper, by allowing a small prediction error tolerance, it is possible to significantly reduce the number of representative paths. This can further be reduced if accurate path-segment delay information can be available at the post-silicon stage. 3. PROBLEM DEFINITION Problem Definition: Given a set of target paths denoted by P tar of size n, we aim to select a set of representative paths P r P tar of size r. Here, we denote the set of remaining n r paths in P tar as P m and their delays as d Pm. Assuming the actual delays of the representative paths denoted by d Pr are available at post-silicon stage, we like to build a prediction model that maps d Pr to d Pm, the actual delays of paths in P m at post-silicon stage. Our objective is to minimize r and identify the representative paths P r, such that the error of the delay prediction model is upper bounded by a sufficient small provided tolerance ǫ. As shown above, we expect to predict the timing of the most possible timing-failure paths correctly, which can help towards more effective and faster debug and diagnosis. To solve this problem, in Section 4, we show that a recentlyintroduced idea known as effective rank [4] of the transformation matrix between parameter variations and target path delays (A in Eqn (1)) can be applied. While the rank of A identifies representative paths which exactly predict delays of the target paths, the effective rank of A is calculated for a given tolerance error and can reduce the number of representative paths. To further reduce the number of selections, in Section 5, we allow selecting path-segments to predict the delays of target paths. We show significant reduction in the total number of selections even when the dimension of independent random variations, x, is very high. Other related research acknowledge that dealing with large number of parameter variations remains a challenge for the simpler problem of chip delay prediction [7]. Assumptions on Accurate Delay Measurement In our problem definition, we assume that accurate postsilicon delay information on a small set of paths/segments can be available. To accurately measure path delays, we can insert special-type scan flipflops (e.g., proposed in [10]). However, to the best of our knowledge, no existing literature has proposed methods for measuring a segment delay. This can perhaps be because the benefits of such measurement are unknown. However, this might be possible by manual efforts (e.g., probing) or by inserting scan flipflops, such as in [10], around the desired segment and handling the loading effects. Regardless that, we believe that custom test structures can be designed so that their delays are highly correlated with the delays of designated segments. It can be quite similar as in [7]. The results of this paper thus encourage the research community to look into custom test structure designs for segment measurement, such that we can reduce the overhead associated with the post-silicon validation effort.

3 4. REPRESENTATIVE PATH SELECTION In this Section, we consider representative path selection for a given error tolerance ǫ. 4.1 Exact Selection We first consider exact selection with ǫ = 0. Let A r be the rows of A corresponding to the r paths we select, and A m be the remaining rows. We can then rewrite Eqn (1) as [ ] [ ] [ dpr µr Ar d Ptar = = + d Pm µ m A m ] x. (3) We first introduce the following Theorem and Lemma: Theorem 1. The smallest r in which we can exactly express d Pm as linear combination of d Pr is r = rank(a). Proof. From Eqn (2), we have d Ptar = Gd S. Therefore, d Pm can be exactly written as linear combination of representative paths d Pr1 with r 1 = rank(g) given definition of matrix rank [5]. Similarly, from Eqn (1), we have d Ptar µ Ptar = Ax. d Pm can also be exactly written as linear combination of d Pr2, with r 2 = rank(a). Therefore, the smallest r to write the d Pm in terms of d Pr is: r = min(r 1, r 2) = min(rank(a),rank(g)). (4) Due to A = GΣ, rank(a) min(rank(g),rank(σ)) always holds [5]. This completes the proof. Theorem 1 shows that for r = rank(a), any set of paths corresponding to any r linearly independent rows of A will suffice as the representative paths. This is from the definition of rank of matrix since the r rows span all the remaining rows of A, and consequently the entire vector of path delay d Ptar can be exactly recovered. To select these r = rank(a) representative paths, we can utilize Algo. 2, which we will illustrate in Section 4.3. Lemma 1. For the smallest number of representative paths, we have r = rank(a) n S, where n S = S. Proof. Since rank(a n x ) rank(g n ns ), we have r = rank(a) n S. The above lemma states that to write an exact linear expression that maps d Pr to d Pm we need at most n S representative paths. Note that n S (number of segments) is at most equal to the number of edges in the timing graph, since the segments are lumped representation of the edges. The number of edges in turn can be much smaller than the target number of paths. To illustrate the above Theorem and Lemma, we consider circuit S1423. In this circuit, we extract 644 statistically critical paths, which indicates n = P tar = 644. These paths cover 415 gates and 255 segments. Since rank(a) = 122 holds, we only need 122 paths to exactly recover the delays of all remaining paths. 4.2 Approximate Selection with Effective Rank As shown in the above Section, for exact path delay prediction, the minimum number of required paths is rank(a), where A is the transformation matrix between d Ptar and x. Here, we will show that by allowing a small error tolerance ǫ, we can greatly reduce the number of required paths by using the novel idea of effective rank of A proposed in [4]. We first explain the idea intuitively. Consider the example of S1423 again. Since the extracted 644 paths can only cover Normalized Singular Values of A (a) Index Normalized Singular Values A (b) Index Figure 2: The normalized singular values of transformation matrix A under two configurations. 255 segments, many paths are forced to share the segments. It indicates that many of the rows in G are similar. That is, for many pairs of paths such as p i and p k, few entries such as g ij and g kj in G might be different, indicating that the two paths only differ in a few segments (i.e., segment s j). Correspondingly, this results in similarity between the i-th and k-th rows of matrix A since Σ is a constant sensitivity matrix. Therefore, intuitively, it seems that we may need much less than rank(a) = 122 to predict the remaining paths with high prediction accuracy. Formally, we can perform singular value decomposition (SVD) over A R n x and obtain A = U V T where U and V are n n and x x orthogonal matrices, respectively, and is a n x diagonal matrix. The diagonal elements λ i in are singular values and follow λ i λ i+1. The rank of A is equal to the largest i such that λ i > 0. These λ is can also provide other insights into the structure of A. Let us denote the energy as E = n i=1 λi, we define the effective rank of A to be [arg min k ( k i=1 λi (1 η)e)], where η is specified as a threshold, for example 5%. So it is the index of the smallest singular value which marks the points exceeding (1-η)% of total energy. This effective rank is shown to be closely related to prediction error ǫ as in [4]. For matrix A, its effective rank can be much smaller than its rank. If its singular value λ is drop with a faster rate, it means that only a few singular values are dominant and the effective rank of A is small. It further indicates that fewer representative paths are required for prediction under a given error tolerance. Figure 2 (a) plots the normalized singular values of A, which is equal to λ i/ λ i, on the y-axis using log-linear scale for S1423. Simulation configuration is given in Section 6. In this figure, we sort the singular values of A in non-increasing order and only plot the first 30 eigenvalues. As shown from the large gap that separates the singular values into subsets of large and small singular values, we can conclude that we may need only 30 paths to predict remaining paths with very high accuracy. However, with the further scaling of the submicron technology, both the dimension and the extent of random variations with respect to the total variation (including die-to-die, within-die, and random variations) would greatly increase. In this case, the number of representative paths would dramatically grow. As an example, we only increase the sensitivity of the independent random variations in A by 3X and plot its normalized singular values in Figure 2 (b). As we can see in this figure, the drop rate of the singular values of matrix A decreases quite a lot compared to Figure 2 (a), which indicates that more representative paths are required for prediction d Ptar. We can show similar plots if we increase the number of random variation.

4 Next, we discuss our proposed approximate path selection procedure assuming ǫ is provided. Specifically, we use effective rank to select representative paths so that the prediction errors for remaining paths are bounded by ǫ. 4.3 Path Selection Procedure with ǫ The high-level algorithm for representative path selection with error tolerance ǫ is given below. We start by exactly selecting r = rank(a) paths as explained in Section 4.1. The initial error in this case is ǫ r = 0. We decrement the number of target paths by one and select r 1 representative paths in Step 2.2. This introduces a new error which we compute in Step 2.3 and update ǫ r. If the new error is still smaller than given ǫ, we repeat another iteration and further decrement r until the error tolerance is reached. Algorithm 1: Representative Path Selection Input: error tolerance ǫ, d Ptar µ Ptar = Ax. 1. Select r = rank (A) representative paths exactly and set error ǫ r = 0 2. While (ǫ r ǫ) 2.1 r r Select r representative paths from P tar 2.3 Update error ǫ r for the newly selected paths. Next, we explain Step 2.2 of Algo. 1 to select r representative paths. We also discuss building a model between d Pr and d Pm and the computation of error ǫ r (Step 2.3). Step 2.2: Selection of r Representative Paths The selection of representative paths is a combinatorial optimization problem which is NP-complete [5]. From algorithmic perspective, it is equivalent to the subset selection problem in computational linear algebra. One procedure to solve this problem approximately is QR decomposition using column pivoting which we discuss below: Algorithm 2: Selection of r Representative Paths Input: Matrix A and r rank(a) 1. Perform SVD decomposition on A = U V T. 2. Perform QR with column pivoting on matrix U r composed by the first r columns of U and get: U T r P r = QR, where P r is n n permutation matrix. 3. Take A r to be the sub-matrix formed by the first r rows of P T r A We first perform SVD decomposition on A to obtain matrix U. Then we apply QR decomposition with column pivoting on U [5]. The input to the procedure is U r, a submatrix formed by the first r columns of U. The matrices Q and R are found during the procedure and help identify the output permutation matrix P r. After obtaining P r, to identify the r representative paths, we compute P T r A and take the sub-matrix formed by the first r rows which in turn relates to r path delays from vector d Ptar. Note, in Algo. 1, as we decrement r at each iteration in the while loop, we apply the column pivoting QR decomposition. This procedure can also be implemented incrementally based on the result of the previous iteration. For more details, we refer the reader to [4]. Step 2.3: d Pr d Pm Model and Error Computation After selecting the r representative paths, we use the following Theorem to build a model between the delays of representative paths d Pr to the delays of remaining paths d Pm. We assume all entries in x are independent and have a standard Gaussian distribution as in [7]. Theorem 2. The optimal linear predictor d Pr d Pm is d Pm = µ m + A ma T r (A ra T r ) 1 (d Pr µ r ), (5) where () 1 denotes the pseudo-inverse operator, and A r, A m µ r, and µ m are defined in Eqn (3). Then, the prediction error for d Pm is r = A ma T r (A ra T r ) 1 A mx A rx = Ω rx, (6) where Ω r A ma T r (A ra T r ) 1 A m A r is constant after selection is performed. It also shows that r is multivariate Gaussian distributed. Error Definition: Eqn (6) shows that we can compute the deviation of predicted path delays compared to exact path delays once we determine the representative paths P r. In this paper, we define the error ǫ r used in Algo. 1 as ǫ r = max i=1,2,...,n r WC( (i) r )/T cons, (7) where T cons denotes the circuit timing constraint, and (i) r denotes the i-th entry of r. The function WC(y) denotes the worst-case value of random variable y. Therefore, Eqn (7) indicates that the worst-case prediction error (deviation from the actual path delay) cannot be larger than ǫ rt cons for all paths in P m when P r is equal to r. Since r follows multivariate-gaussian distribution with known mean and variance, we can compute max( r (i) ) analytically. More importantly, the definition in Eqn (7) indicates that a maximum prediction error of deviation in path delays can be set to ǫt cons, which can be used as guard-band in post-silicon analysis to determine with full confidence if a path will fail the timing constraint. This upper bound may still be too pessimistic. In fact, for each path i in P m, a separate error of ǫ it cons can be defined from Algo. 1 and used as a guard-band for more accurate analysis. We further discuss guard-band analysis and demonstrate it in our simulation results. 4.4 Complexity Analysis In Algo. 1, we call Algo. 2 at most by r = rank(a) times. Each call requires one SVD and one QR decomposition as its dominant computing requirements. Sophisticated algorithms exist to solve these procedures. We use svd() and qr() functions from Matlab in our simulations. Generally, if the number of target paths is very large, we can apply a clustering procedure to form clusters of paths of smaller size for speedup. Furthermore, building the model between d Pr and d Pm and evaluating the error (using Theorem 2) are all done analytically and very efficiently. 5. HYBRID PATH/SEGMENT SELECTION Motivated by Figure 2, when the dimension and range of random variations increase, we need to select more paths. In this Section, we allow modeling path delays using the delays of a set of representative segments, and expect to further reduce the post-silicon validation effort. As we have already mentioned in Section 3, currently, there are no techniques for measuring segment delay. Segment selection can guide design of custom test structures which can be measured at the post-silicon stage and further reduce the post-silicon effort. Our goal is to show the benefits of knowing segment delays at the post-silicon stage, as shown by our simulations assuming that accurate post-silicon delays of segments are available.

5 We outline our proposed hybrid path/segment selection algorithm in Algo. 3, which has the same ǫ as in Algo. 1. Specifically, we can use Algo. 1 to solve Step 1 and Step 4. Step 3 is rather straightforward and can be done using standard least square method [8] and skipped due to lack of space. Here, ǫ in Step 2 is smaller than ǫ since there exists additional error from the delays of P r1 to those of P tar. We will discuss the selection on ǫ in simulation results. In the remaining of this Section, we discuss formulating Step 2 to a convex optimization problem. Algorithm 3: Hybrid Path/Segment Selection Input: error tolerance ǫ, d Ptar = Gd S = µ Ptar + Ax 1. Select a set of representative paths P r1 to model d Ptar with zero error tolerance 2. Select a set of representative segments S r1 to model d Pr1 from Step 1 with error tolerance ǫ < ǫ 3. Use delays of S r1 from Step 2 to model d Ptar ; detect the set of paths P r2 with prediction error larger than ǫ 4. Select paths/segments from P r2 and S r1 to form P r and S r with zero error tolerance. Step 2: Representative Segment Selection In this Step, we expect to select S r1 from S and build a model to predict d Pr1 Bd S, where d Pr1 and d S denote the delays of paths P r1 and segments S, respectively. From d Ptar = Gd S in Eqn (2), we can obtain d Pr1 = G r1 d S, where G r1 can be derived from G. Thus, we can express the prediction error in this approximation as r1 = d Pr1 Bd S = (G r1 B)d S, (8) which indicates that r1 follows a multivariate Gaussian distribution. Once B is determined, we can analytically obtain the mean and covariance of r1. In order to obtain the optimal B, following similar procedures in [9], we can write a mathematical formulation as min B s.t B T l0 /l q WC( (i) r 1 ) ǫ T cons, for i = 1,2,..., r 1 (9) where (i) r 1 denotes the i-th entry of r1, and WC(y) denotes the worst-case value of random variable y, the same as in Eqn (7). We also denote l i as the i-th norm. Based on the definition of l 0 norm, B T l0 /l q counts the number of columns in B that has a non-zero l q norm. Therefore, l 0/l norm represents the number of non-zero columns in B, which is equal to the number of selected segments S r1. Since the l 0 component in Eqn (9) yields a non-convex and computationally intractable problem, [9] proposed a relaxation to turn Eqn (9) into an easier problem based on l 1/l norm: min B s.t. n S i=1 max( b 1i, b 2i,..., b ni ) WC( (i) r 1 ) ǫ T cons, for i = 1,2,..., r 1 (10) where b ij is the ij-th element of B. Finally, we can observe that the formulation in Eqn (10) is convex because: 1) the objective function is convex via defining auxiliary variables z i = max( b 1i, b 2i,..., b ni ) and translating each max into n linear constraints. 2) the constraint is quadratic with respect to B after taking square operation on both sides. Since the procedure is straight forward, we refer to [9] for further details in solving the above convex optimization efficiently. Table 1: Results for Approximate Path Selection. Configurations Exact Approximate BENCH G R P tar P r P r e 1% e 2% S S S S S S S S S S Ave SIMULATION RESULTS We synthesized ISCAS 89 benchmarks using 90nm TSMC library and Synopsys Design Compiler for minimum area under a stringent timing constraint to ensure that the circuits are optimized. We assume parameter variations in effective channel length L eff and zero-bias threshold voltage V t, which are Gaussian distributed with standard deviation equal to 10% of their mean. To capture the spatial correlation between these parameter variations, we use the hierarchical model in [2], which defines rectangular regions on the chip. Column 3 in Table 1 gives the total number of regions ( R ) of each benchmark. As shown in this table, for smaller benchmarks, we use a 3-level model resulting in 21 regions. For larger ones, we use a 5-level model resulting in 341 regions. This assumption is consistent with [7]. In addition, each gate has a random variation term, which is 6% of the total variations. Note that this cannot be captured by [7]. In our simulation, we adopt the algorithm in [11] to extract P tar. Specifically, we extract all paths with path yield smaller than a given threshold since they are more likely to fail the timing at post-silicon stage. Note that our approaches can incorporate other path extraction algorithms. Finally, to evaluate our approaches, we generate N = 10, 000 samples for process variations, and compute the delays of the representative components (segments or paths depending on the approach). Then we predict the delays of the remaining paths using our representative components and compare them with their actual delay values provided by the samples. We evaluate the following metrics: Metric: ε i and ˆε i indicating the maximum and average relative prediction errors for i-th remaining path, respectively; e 1 and e 2 indicating the average of ε i and ˆε i over all remaining paths (P tar P r), respectively: where d (k) pred ε i = max N k=1 N ˆε i = 1 N k=1 d (k) pred (i) d(k) true (i) d (k) true d (k) pred (i) d(k) true (i) d (k) true, e 1 =, e 2 = n r i=1 ε i n r, n r i=1 ˆε i n r, (i) and d(k) true(i) are the predicted and actual delays of the i-th remaining path for sample k. 6.1 Approximate Path Selection We first evaluate approximate and exact path selection approaches. We set the timing constraint to be nominal circuit delay (without variation) and select all paths with a timing yield-loss greater than 0.01(1 Y ) where Y is circuit timing yield. We set ǫ = 5% in Algorithm 1. In Table 1, Columns 2 and 3 give the total number of gates and regions in this circuit ( G and R ). The number of extracted paths ( P tar ) is given in Column 4. For some benchmarks, we cannot extract many paths, since these circuits are intrinsically unbalanced [7]. Column 5 shows the number of representative paths of the exact approach ( P r ).

6 Table 2: Results for Evaluating Hybrid Path/Segment Selection. Configurations Approx. Path Selection Hybrid Path/Segment Selection BENCH G R G C R C P tar P r e 1% e 2% P r S r P r + S r e 1% e 2% S S S S S S S S S S Ave This number is much smaller than the number of target paths P tar in Column 4. The prediction error in this case is zero. Column 6 shows the number of representative paths using the approximate approach when error tolerance ǫ in Algo 1 is set to 5%. We can observe significant reduction in the number of representative paths. Also, the number of representative paths is scalable to the size of P tar and relevant to the circuit topology. Generally, when the paths in P tar are more correlated, we can achieve higher reduction after Algo. 1. Specifically, for circuit S38417, we extracted 692 paths and 190 paths are required for exact selection with zero prediction error. With error tolerance ǫ = 5%, we can reduce P r to 44. We also report e 1 and e 2 in Columns 7-8, which are rather small. The indications of these errors will be discussed in Section 6.3 for guard-band analysis. 6.2 Hybrid Path/Segment Selection We evaluate our hybrid path/segment selection approach. To make the hybrid approach meaningful, we relax the timing constraint to extract more critical paths (Note P tar in Table 1 is too small). We still use the same path yield threshold to extract P tar. We first implement approximate path selection approach with ǫ = 8% (Algo. 1). We then implement our hybrid approach in Algo. 3 with the same ǫ = 8%. In terms of ǫ used in Algo. 3, since hybrid path/segment selection is performed at design stage and can be parallelized, in our simulation, we try different ǫ < ǫ and use the one with minimum P r + S r. Note that in this paper, our goal is to show the benefit of knowing actual segment delay, and we leave the selection on theoretical optimal ǫ for future. Table 2 reports the results of this experiment. Columns 2-6 report the total number of gates and regions of the circuit ( G and R ), the number of gates and regions covered by P tar ( G C and R C ), and the number of extracted paths P tar, respectively. Take S38417 as an example. We extract 3,507 target paths which cover 1,386 gates and 157 regions. Thus, for S38417, we have 1,700(= 1, ) independent variations, where 1,386 denotes the random variation for each gate, and 157 denotes the global/local variations, and 2 denotes the parameters of L eff and V t. For approximate path selection, we report the number of selected paths P r and errors e 1, e 2 in Columns 7-9, respectively. For the hybrid approach, we report P r (the number of selected paths) and S r (the number of selected segments), and their summations (total post-silicon delay information) in Columns 10-12, respectively. The e 1 and e 2 are reported in Columns 13 and 14. These are combined errors of the path and segment modeling procedures. Compare Column 7 with 12, and we can see in the majority of the cases, the hybrid approach can reduce the number of post-silicon delay measurements to less than 100. When the extracted paths in P tar are less correlated (illustrated by Column 7), our reduction is significant, while the errors do not change much. In addition, for nearly all benchmarks, the number of selected segments is less than Guard-band Analysis Section 4.3 gives the specific definition to error ǫ and relate it to guard-band analysis at post-silicon stage. We let the guard-band to be the gap that we need to consider after we make path delay prediction at post-silicon, and denote it as φ. As defined in error definition in Section 4.3 and 5, we have φ upper-bounded by ǫt cons, where ǫ = 5% in Table 1 and ǫ = 8% in Table 2. As seen in these two tables, we have e 1 to be 2.29%, 3.05% and 3.54% on average for different approaches. This is the average φ for all predicted paths as illustrated by the definitions of ε i and e 1. It indicates that for paths p i, if the predicted path delay is d (k) pred (i) divided by 1 εi, is larger than T cons, this path fails the timing. We can see that the average guard-band e 1 is smaller than our pre-specified error tolerance ǫ. In addition, the average prediction error over all predicted paths in our MC simulation is very small (as illustrated by e 2). It indicates that our defined guardband is useful to facilitate the post-silicon failure detection. 7. CONCLUSIONS AND FUTURE WORK We present a framework for post-silicon timing prediction under variability. Knowing the actual delays on a small set of carefully selected paths/segments, we can predict the timings of a large pool of target paths at post-silicon stage. Simulations show that we can use very few selections to predict up to 3,500 paths with very high prediction accuracy even when the dimension of process variations is larger than 1,000. In addition, our framework can help to guide custom test structure designs. We also plan to incorporate our framework into post-silicon diagnosis in the future. 8. REFERENCES [1] Bastani, P., and et. al. Speedpath prediction based on learning from a small set of examples. In DAC (2008). [2] Blaauw, D., and et. al. Statistical timing analysis: From basic principles to state of the art. TCAD 27, 4 (2008). [3] Callegari, N., and et. al. Path selection for monitoring unexpected systematic timing effects. In ASPDAC (2009). [4] Chua, D., and et. al. Network kriging. JSAC 24, 12 (2006). [5] Golub, G., and Loan, C. Matrix Computations, 2nd ed. The Johns Hopkins University Press, London, [6] Josephson, D. The good, the bad, and the ugly of silicon debug. In DAC (2006). [7] Liu, Q., and Sapatnekr, S. Synthesizing a representative critical path for post-silicon delay prediction. In ISPD (2009). [8] Nocedal, J., and Wright, S. Numerical Optimization, 2nd ed. The Springer Press, New York, [9] Turlach, B., and et. al. Simultaneous variable selection. Technometrics 27 (2005). [10] Wang, X., and et. al. Path-RO: a novel on-chip critical path delay measurement under process variations. In ICCAD (2008). [11] Xie, L., and Davoodi, A. Bound-based identification of timing-violating paths under variability. In ASPDAC (2009).

Integer Least Squares: Sphere Decoding and the LLL Algorithm

Integer Least Squares: Sphere Decoding and the LLL Algorithm Sanzheng Qiao Department of Computing and Software McMaster University 28 Main St. West Hamilton Ontario L8S 4L7 Canada. ABSTRACT This paper