ACM transactions on design automation of electronic systems (April) [Volume 10, Number 2]

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Scheduling and Optimal Register Placement for Synchronous Circuits Derived Using Software Pipelining Techniques NOUREDDINE CHABINI Royal Military College of Canada EL MOSTAPHA ABOULHAMID Universite´ de Montreal ´ ISMA¨IL CHABINI Massachusetts Institute of Technology and YVON SAVARIA ´ Ecole Polytechnique de Montreal ´

Data dependency constraints constitute a lower bound P on the minimal clock period of single-phase clocked sequential circuits. In contrast to methods based on basic retiming, clocked sequential circuits with clock period P can always be obtained using software pipelining techniques. Such circuits can be derived by any method that can be framed in the following four-step process: Step 1, determine P; Step 2, compute a valid periodic schedule of the computational elements; Step 3, place registers back to the circuit; Step 4, assign the clock signals to control registers. Methods with polynomial run-time to implement this process are proposed in the literature. They implement these steps sequentially, starting with Step 1. These methods do not know how to optimally place registers which leads to an unnecessary number of registers. In this article, we address the problem of how to simultaneously implement Steps 2 and 3 in order to minimize the total number of registers. We conjecture that the problem is NP-hard in its general form. We formulate the problem for the first time in the literature, and devise a Mixed Integer Linear Program (MILP) to solve it. From this MILP, we derive a linear program to determine approximate

This research benefited from financial support from Le Fonds Nature et Technologies (Quebec, Canada), NSF (USA), NSERC (Canada). Authors’ addresses: N. Chabini, Department of Electrical and Computer Engineering, Royal Military College of Canada, PO Box 17000, Station Forces, Kingston, On, Canada, K7K 7B4; email: [email protected]; E. M. Aboulhamid, DIRO, Universit´e de Montr´eal, C.P.6128, Suc. Centreville, Montr´eal, Qc, Canada, H3C 3J7; email: [email protected]; I. Chabini, Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Room 1-263, Cambridge, MA, USA, 02139; email: [email protected]; Y. Savaria, Department of Electrical En´ gineering, Ecole Polytechnique de Montr´eal, C.P. 6079, Suc. Centre-ville, Montr´eal, Qc, Canada, H3C 3A7; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected].  C 2005 ACM 1084-4309/05/0400-0187 $5.00 ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005, Pages 187–204.

188



N. Chabini et al.

solutions to the problem for large general circuits. We show that the proposed approach can handle nonzero clock skew. Experimental results confirm the effectiveness of the approach and show that significant reductions of the number of registers can be obtained although register sharing is not used. When the schedule is given, the proposed approach provides solutions to the problem of how to place the minimal number of registers in Step 3. Categories and Subject Descriptors: B. [Hardware] General Terms: Algorithms, Performance Additional Key Words and Phrases: Retiming, software pipelining, multiphase, clock, sequential circuit

1. INTRODUCTION Data dependency constraints constitute a lower bound on the clock period of synchronous sequential circuits. This lower bound, denoted here P, can be determined by solving an instance of the well known Cost-to-Time Ratio Cycle Problem [Dasdan and Gupta 1998; Gerez et al. 1992; Lawler 1976] on the graph modeling the circuit. Basic retiming has been proposed as an optimization technique for synchronous circuits [Leiserson and Saxe 1991]. This technique changes the location of registers in the circuit in order to achieve one of the following goals: i) minimizing the clock period, ii) minimizing the number of registers, or iii) minimizing the number of registers for a target clock period. Basic retiming [Leiserson and Saxe 1991] may fail in transforming a given synchronous single-phase sequential circuit to another functionally equivalent clocked circuit with a clock period of value P. Indeed, as presented in Boyer et al. [2001a] and Lockyear and Ebeling [1994], basic retiming can transform the correlator circuit to another one with a minimal clock period of value 13, but a functionally equivalent circuit of a clock period of value P = 10 can be obtained as provided in those papers. Figure 1 presents two other circuits to show that 1) basic retiming can fail in producing circuits with clock period of value P that is due to data dependency constraints only, and 2) to present how much reductions of the clock period one can obtain by using methods based on software pipelining techniques like the method in Boyer et al. [2001a], instead of using methods based on basic retiming. For circuits in Figure 1, basic retiming gives a minimal clock period of value 60 for circuit #1, and 45x for circuit #2. Functionally equivalent circuits with clock period P = 45 can be obtained using, for instance, the method in Boyer et al. [2001a]. This reduces the clock period by 25% for circuit #1 and by ((x − 1)/x) for circuit #2 (for instance, when x = 2, this reduction is 50%). When basic retiming fails to obtain a circuit with clock period of value P, then a functionally equivalent circuit with clock period of value P can be obtained with the penalty of increasing the number of clocks (phases), and such a circuit is then called a multiphase clocked sequential circuit. For instance, the correlator produced in Boyer et al. [2001a] and Lockyear and Ebeling [1994] is a twophase circuit. Details on multiphase clocked sequential circuits can be found, for instance, in Ishii et al. [1997] and Lockyear and Ebeling [1994]. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Scheduling and Optimal Register Placement for Synchronous Circuits



189

Fig. 1. Examples to show that basic retiming can fail in minimizing the clock period.

Methods to transform single-phase clocked sequential circuits to functionally equivalent ones with the clock period as close as possible to P are proposed in Legl et al. [1997], Ishii et al. [1997], Lockyear and Ebeling [1994], and Maheshwari and Sapatnekar [1999]. In Legl et al. [1997], basic retiming has been extended to deal with circuits whose registers are not enabled at the same time. The idea is that registers controlled by the same phase can be moved across computational elements. It is known that with level-sensitive storage elements (latches), clocked circuits can be made faster and smaller [Ishii et al. 1997; Lockyear and Ebeling 1994] than with edge triggered flip-flops. In Ishii et al. [1997], methods to minimize the clock period of multiphase level-sensitive clocked circuits are provided. Also, procedures to derive these kinds of circuits from edge-triggered ones are presented. In Lockyear and Ebeling [1994] and Maheshwari and Sapatnekar [1999], retiming with multiphase clocks is proposed. For methods in [Lockyear et al. 1994; Maheshwari et al. 1999], the phases are fixed before retiming which can give a clock period of value P only if good phases are chosen. Clock skew is defined as the maximum difference of the delays from the clock source to the clock-pins on storage elements [Tsay 1993]. Clock skew can cause malfunction of clocked circuits. Methods to ensure zero-skew in the design are reported in Li and Jabori [1992] and Tsay [1993]. However, skews are sometimes used as a tool to improve the performance of clocked circuits [Fishburn 1990; Deokar and Sapatnekar 1995; Sapatnekar and Deokar 1996]. In Fishburn [1990], two linear programs are presented to solve the problem of finding skews to minimize the clock period and the problem of maximizing skews for a target clock period. The equivalence between clock skew and retiming was first reported in Fishburn [1990], and a formal proof is provided in Deokar and Sapatnekar [1995]. For the work in Deokar and Sapatnekar [1995], a clock skew optimization problem is first solved with the objective of minimizing the clock period. Then, the obtained skews are transformed to retiming by moving some flip-flops across combinational blocks. For single-phase clocked circuits, a mixed integer linear program to combine retiming and clock skew is devised in Friedman et al. [1999] and Liu et al. [2002]. As presented in ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

190



N. Chabini et al.

Boyer et al. [2001b], the tolerance to the clock skew for clocked circuits can be improved by using latches instead of flip-flops. The paper shows that, for multiphase clocked circuits operating at the minimal clock period P, the maximum tolerance to clock skew is (P−Dmax )/4, where Dmax is the propagation delay of the slowest computational element in the circuit. Software pipelining is a powerful technique for increasing the instructionlevel parallelism for parallel processors. This method overlaps the execution of successive iterations in order to reduce the difference of their start execution times. For an introduction to software pipelining and to its related techniques, the reader is referred to Allan et al. [1995]. To the best of our knowledge, no method based on basic retiming can always transform single-phase clocked sequential circuits to functionally equivalent clocked sequential circuits with a minimal clock period P that is due to data dependency constraints only. Methods based on software pipelining techniques to obtain the latter circuits have been recently proposed [Boyer et al. 2001a, b; Chabini et al. 2001]. Neither the number of phases nor the kind of memory elements to be used are fixed in advance in Boyer et al. [2001a] and Chabini et al. [2001], compared to some published approaches like Lockyear and Ebeling [1994] and Maheshwari and Sapatnekar [1999] that we reviewed previously. As mentioned, the methods in Lockyear and Ebeling [1994] and Maheshwari and Sapatnekar [1999] can produce circuits with clock period equal to P only if good phases are chosen, while the methods in Boyer et al. [2001a] and Chabini et al. [2001] are always able to obtain circuits that operate at P. The latter methods can be framed in the following process. Step 1: Determine the minimal value P of the clock period due to data dependency constraints only. Step 2: Compute a valid periodic schedule of the computational elements with period P. Step 3: Place registers in the circuit according to the computed schedule. Step 4: Determine the phases to control registers. The method in Boyer et al. [2001a] implements this process sequentially, starting from Step 1. For Step 2, only As Soon As Possible (ASAP) or As Late As Possible (ALAP) schedules are computed. As presented in Chabini et al. [2001], using ASAP or ALAP schedules can lead to circuits with an unnecessary number of registers or phases. For Step 3, this method uses a heuristic, which again can lead to an unnecessary number of registers or phases. The paper Chabini et al. [2001] has provided two methods with polynomial run-time to determine schedules for reducing register requirements and the number of required phases. Compared to Boyer et al. [2001a], these methods proved very efficient in reducing the number of registers and the number of required phases. Nevertheless, the problem of how to efficiently place registers in the circuit is not addressed in Chabini et al. [2001]. For software pipelining in the context of loops, methods for scheduling under register constraints to generate the code for parallel processors has been examined in the literature. But, it was assumed that processors are single-phase ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Scheduling and Optimal Register Placement for Synchronous Circuits



191

clocked. Circuits derived from the previously described process can be multiphase. Hence, these methods cannot be used to simultaneously implement Steps 2 and 3 of the process. In this article, we address the problem of how to simultaneously implement Steps 2 and 3 of the process in order to minimize the number of registers. We propose the first formulation in the literature for this problem, from which we derive a mixed integer linear program (MILP). We conjecture that the problem is NP-hard in its general form. Linear Programs (LPs) are solvable in polynomial run-time. From this MILP, we derive an LP to determine approximate solutions to the problem for large general circuits. Furthermore, we present how the proposed approach can handle nonzero clock skew. To test the effectiveness of the approach in minimizing the number of registers, we apply the MILP and the LP on well-known benchmarks and show the superiority of that approach over the method in Boyer et al. [2001a]. The assessment of the approach is also done in the case of nonzero clock skew, and the obtained results show the superiority of the approach over the method in Boyer et al. [2001b]. We compare our experimental results to Boyer et al. [2001a, b] since to the best of our knowledge, there are no other papers at this moment that are close to the issue we address here. The next section gives some notations and definitions used in this article. Section 3 introduces the mean of register placement, briefly reviews the registers placement step in the method of Boyer et al. [2001a], presents how the phases to control registers are computed, and shows that the algorithm proposed in Boyer et al. [2001a] to place registers is not exact. The problem we address and its formulation are presented in Section 4. A linear program to determine approximate solutions for this problem is given in Section 5. Section 6 presents how the proposed approach can handle nonzero clock skew and gives a theoretical result. Experimental results are provided in Section 7, and Section 8 concludes the article. 2. PRELIMINARIES 2.1 Design Representation The input to our approach in this article is a single-phase synchronous sequential circuit as the one in Figure 2(a). As in Boyer et al. [2001a], Maheshwari and Sapatnekar [1997], Shenoy and Rudell [1994], and Leiserson and Saxe [1991], we model the input circuit by a directed cyclic graph G = (V, E, d, w), where V is the set of computational elements in the circuit, and E is the set of edges, which represent interconnections between vertices. Let N be the set of nonnegative integers. Each vertex v ∈ V has a propagation delay d(v) ∈ N which is assumed to be fixed in this article. Each edge eu,v , from u to v, in E is weighted with a register count w(eu,v ) ∈ N, representing the number of registers on the wire between u and v. As in Boyer et al. [2001a], Maheshwari and Sapatnekar [1997], Shenoy and Rudell [1994], Leiserson and Saxe [1991], propagation delays of registers and wires are assumed to be equal to zero. We believe that this delay model is ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

192



N. Chabini et al.

Fig. 2. Sample circuit and its directed cyclic graph model.

acceptable at the high-level abstraction of the design, but not when computational elements are, for instance, transistors. Even though we assume this delay model, the problem we address in the article is still complex. Figures 2(a) and 2(b) present an example of a single-phase synchronous sequential circuit and its directed cyclic graph model, respectively. In Figure 2(a), large rectangles represent computational elements and small rectangles represent registers. Wires are oriented to show the propagation direction of the signals. The propagation delay of each computational element of this circuit is specified as a label on the left of each large rectangle. This example will be used through this article, and will serve to illustrate the initial specification of the problem to be solved. Without any optimization, the minimum clock period of the circuit in Figure 2 is 80 which is equal to d(v5 ) + d(v1 ) + d(v3 ). 2.2 Periodic Schedules We define a schedule s [Bennour 1996; Boyer et al. 2001a] as a function s : N × V → Q, where sn (v) ≡ s(n, v) denotes the schedule time of the nth iteration of operation v. In multiphase flip-flop-based circuits, the schedule time of operation v is the start execution time of v. A schedule s is called periodic with period P, if: ∀ n ∈ N,∀v ∈ V : sn+1 (v) = sn (v) + P.

(1)

When there is no resource constraint, a schedule s is said to be valid if and only if the operations terminate before their results are needed. In this case, we say that data dependencies are satisfied which is equivalent to the following mathematical inequality: ∀ n ∈ N, ∀eu,v ∈ E : sn+w(eu,v ) (v) ≥ sn (u) + d(u). ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

(2)

Scheduling and Optimal Register Placement for Synchronous Circuits



193

2.3 Maximum Throughput of Synchronous Sequential Circuits Let C be the set of directed cycles in the directed cyclic graph modeling the circuit. Based on data dependency constraints only, the maximum throughput, denoted T, is given by the following expression [Bennour 1996; Bennour and Aboulhamid 1995]:      T = Minc∈C w(eu,v ) d(u) (3) eu,v ∈c

∀v∈V and eu,v ∈ c

Determining the maximum throughput is a Minimal Cost-to-Time Ratio Cycle Problem [Gerez et al. 1992; Lawler 1976]. This problem can be solved in the general case with a run-time in O(|VE|log(|V|dmax )) [Dasdan and Gupta 1998; Lawler 1976], where dmax = Maxv∈V (d(v)). A possible method to solve this problem is to iteratively apply Bellman-Ford’s algorithm [Cormen et al. 1990] for longest paths on the graph Gp = (V, E, d, wp ) derived from G by letting: wP (eu,v ) = d(u) − P · w(eu,v ),

(4)

where eu,v ∈ E and P = 1/T. A binary search may be used to find the minimal value of P for which there is no positive cycle in GP [Bennour 1996; Bennour and Aboulhamid 1995]. Without loss of generality, for circuits that do not attempt to perform wave pipelining, we assume that P is greater than or equal to the propagation delay of each computational element in the circuit. By applying expression (3) on the example circuit in Figure 2, the value of P is 60. This value corresponds to the cycle defined by vertices v1 , v2 , v4 , and v5 . Notice that applying basic retiming for minimal clock period on that circuit leads to a larger value of P. Indeed, it leads to P = 70. 2.4 Periodic Schedule for a Given Period From equation (1) and inequality (2), we have that: ∀eu,v ∈ E, s0 (v) − s0 (u) ≥ d(u) − P · w(eu,v ).

(5)

In the case of periodic schedules, determining a valid schedule of all the instances of each vertex v in V is equivalent to determining s0 (v) for each v in V, which consists of finding a solution to the system of inequalities described by (5). To solve this system, the graph GP described in the previous section may be used. Note that ASAP and ALAP schedules are possible solutions to this system. To find an ASAP schedule, Bellman-Ford’s algorithm [Cormen et al. 1990] for longest paths, from a chosen vertex vx to the other vertices, may be applied on the graph GP . Finding an ALAP schedule may be done as follows. In Step 1, a graph G′ has to be derived from GP by inverting the direction of each edge in GP . In Step 2, Bellman-Ford’s algorithm for longest paths, from the vertex vx to the other vertices, has to be applied on the graph G′ , where the weights of its edges are defined by Equation (4). Finally, in Step 3, the ALAP schedule is obtained by multiplying each result in Step 2 by −1. Relative to vx = v1 , the ASAP schedules of vertices v1 , v2 , v3 , v4 , v5 , and v6 of the circuit in Figure 2 are 0, −30, 30, −10, −40, and −30, respectively. Their ALAP schedules are 0, −30, 40, −10, −40, and 10, respectively. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

194



N. Chabini et al.

Fig. 3. Schedule graph.

2.5 Schedule Graph A periodic schedule, with period P, is expressed by a schedule graph Gs = (V, E, d, Ts , P) [Boyer et al. 2001a]. Here V, E, and d have the same definition given for the case of the graph G previously defined. Ts : E → Q is a weight function which associates to each edge eu,v in E the time distance between the schedule times of u and v. Mathematically, Ts (eu,v ) is defined as follows: ∀eu,v ∈ E, Ts (eu,v ) = sw(eu,v ) (v) − s0 (u).

(6)

Because s is periodic with period P, Equation (6) may be rewritten as follows: ∀eu,v ∈ E, Ts (eu,v ) = s0 (v) − s0 (u) + P · w(eu,v ).

(7)

The graph Gs is consistent if and only if for each edge eu,v in E, Ts (eu,v ) ≥ d(u). This is derived from Equation (2). Figure 3 shows a consistent schedule graph, where edges are labeled with Ts values for the circuit in Figure 2, using the ASAP schedule determined in Section 2.4. The weight of each arc in the schedule graph is in term of number of units of time. 3. REGISTER PLACEMENT AND ASSIGNMENT OF PHASES For circuits optimized using basic retiming [Leiserson and Saxe 1991], registers are placed in the optimized circuit using the following formula: ∀eu,v ∈ E, wr (eu,v ) = r(v) − r(u) + w(eu,v ), where wr (eu,v ) and w(eu,v ) are, respectively, the number of registers on the arc eu,v , after and before retiming. r(u) is the value assigned by basic retiming to each computational element u in the circuit. In the rest of this section, we show how registers can be placed and controlled in circuits derived by the process we presented in Section 1. To this end, we review the method in Boyer et al. [2001a] which is a possible implementation of that process. The approach we are proposing in this article leads to better implementations of the process. For the method proposed in Boyer et al. [2001a], registers are placed back to the circuit by pipelining the schedule graph Gs defined in Section 2.5. Every path in Gs that is longer than the minimal clock period P is broken by inserting ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Scheduling and Optimal Register Placement for Synchronous Circuits



195

Fig. 4. Placement and phases of registers using algorithm in Boyer et al. [2001a].

registers on it. For paths having a length (in term of number of units of times) less than P, no register is required if operations chaining is assumed. For synchronous single-phase sequential circuits, registers are controlled by the same signal, called the clock. When clock skew is not supported, registers in that case must receive the clock at the same moment. In synchronous multiphase sequential circuits, registers are not necessarily controlled by the same clock. In this case, the clocks can have the same period and be defined relatively to a global clock that can be one of those clocks. Each clock is then an offset of the global clock. That offset is called the phase in the literature. Circuits derived by the process we presented in Section 1 can be multiphase, and all the clocks have the same period. In the case of the method in Boyer et al. [2001a], which is a possible implementation of the process, once registers are placed, the phases to control them are then computed as follows. The phase of a register on the input of a computational element v is (s0 (v) modulo P), where s0 (v) is the schedule of v, and P is the minimal clock period due to data dependency constraints only. Figures 4(a) and 4(b) present the placement of registers and their phases obtained using the algorithm provided in Boyer et al. [2001a] to place registers using the schedule graph depicted in Figure 3. The latter graph corresponds to the circuit in Figure 2 and is obtained as explained in Section 2.5. Data in Figure 4(c) is provided to assist the reader interested in computing the phases given in Figure 4(b). The number of registers that are placed in the circuit is 6, and the number of phases to control them is 4. The algorithm for register placement in Boyer et al. [2001a] is not optimal in the sense that it does not use a minimum number of registers. Indeed, for ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

196



N. Chabini et al.

Figure 4(a), register R1 can be omitted since there is no combinational path longer than P between R4 and R5 . 4. PROBLEM FORMULATION AND APPROACHES FOR ITS RESOLUTION Our focus is to simultaneously realize Steps 2 and 3 in the process presented in Section 1 in order to minimize the number of registers. The problem, denoted , we address in this article is then to determine a schedule with the minimum register requirements, where the register placement is done during the schedule determination. We do not support register sharing as in the case when basic retiming is used, since, in our case, the obtained circuits can be multiphase clocked sequential circuits, and, in this case, registers on the output of a computational element can be shared only if they are controlled by the same phase. However, once the registers are placed, one can examine the phases of registers on the output of each computational element to decide whether to share them. Let us present the problem  in a way that makes it easier to understand our approach in solving it. As explained in Section 3, the placement of registers consists in pipelining the schedule graph to obtain a circuit that can operate with the minimal clock period P. Recall that in Boyer et al. [2001a] the placement of registers is done once the schedule is computed. If the schedule is given, the problem  transforms into a problem of pipelining the schedule graph, while using a minimal number of registers. The weight of each arc in the schedule graph is given by Equation (7) (i.e., ∀eu,v ∈ E, Ts (eu,v ) = s0 (v) − s0 (u) + P· w(eu,v )). Instead of fixing the schedule first, before pipelining the schedule graph, we want to make the schedule a variable in the problem and then to pipeline the resulting schedule graph. We conjecture that the problem  is NP-hard in its general form. We provide in this section a mathematical formulation (MF) to the problem. From this MF, we derive a mixed integer linear program (MILP) that can be used for solving the problem for special or small-size circuits. In Section 5, we derive from this MILP a linear program to determine approximate solutions to the problem for general large circuits. Before presenting the details related to MF and MILP, let us first give some definitions and notations while introducing an informal formulation of the problem. Figure 5 gives a portion of the schedule graph to pipeline, where i and j are two computational elements. Unknown variables xi,j denote the number of registers that must be placed on the arc, ei,j , to guarantee that the length, li,j , of every path that goes to j via i is less than or equal to the minimal clock period P. Variable li,j will be defined in the following. Note that as in Boyer et al. [2001a], operation chaining is assumed, and hence no register is required if li,j ≤ P. Suppose that paths that go to j via i are already examined in order to determine if some registers must be placed on them or not. Let mi be a nonnegative real number greater than or equal to each remainder that is obtained by dividing the length of each one of those paths by P. The length li,j of every path that goes to j via i is the sum of mi and Ts (ei,j ), where Ts (ei,j ) is defined by Equation (7). Variable yi,j is the remainder of the division of li,j by P. We require that mi ≤ (P−d(i)) which guarantees that, if a register R is placed on the ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Scheduling and Optimal Register Placement for Synchronous Circuits



197

Fig. 5. Illustration of some unknown variables of the MF.

Fig. 6. A mathematical formulation to the problem .

output of computational element i, then R will be activated after i finishes the execution. Figure 6 presents a mathematical formulation to the problem . The objective function expresses the number of registers to be placed in the circuit. Equations (8), (9), and (10), (11) are equivalent to the definition of xi,j , yi,j , and mi , respectively. Inequality (12) is equivalent to (5). (13) is required, since the number of registers must be nonnegative integers, and mi is by definition a nonnegative real. In this formulation, the unknown variables are xi,j , yi,j , mi and the schedule s0 (i) for each computational element i. The formulation in Figure 6 can be linearized as follows. Using the fact that ⌊x⌋ ≤ x < ⌊x⌋ + 1, and that no register is required if the length of a path is less than or equal to P, Equation (8) can be replaced by: ∀ei,j ∈ E, xi,j ≤ (s0 (j) − s0 (i) + P · w(ei,j ) + mi )/P ≤ xi,j + 1.

(14)

Equations (9) and (10) together can be replaced by: ∀ek,i ∈ E, mi ≥ (s0 (i) − s0 (k) + P · w(ek,i ) + mk ) − P · xk,i .

(15)

After linearizing the formulation in Figure 6, we obtain the MILP presented in Figure 7. In this figure, Equations (16) and (17) are equivalent to (14); (18) is equivalent to (15); (19), (20), and (21) are equivalent to (11), (12), and (13), respectively. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

198



N. Chabini et al.

Fig. 7. An MILP for solving the problem .

Fig. 8. A linear program to determine approximate solutions to the problem .

5. A LINEAR PROGRAM TO DETERMINE APPROXIMATE SOLUTIONS TO THE PROBLEM  Linear programs are solvable in polynomial time [Karmakar 1984; Khachian 1979]. A linear program to determine approximate solutions to the problem  can be obtained by deleting the constraint that xi,j is an integer in Figure 7. In this case, once the linear program is solved, the number of registers to be placed on the arc ei,j is ⌈xi,j ⌉. Details on why it is possible to place ⌈xi,j ⌉ registers on the arc ei,j can be found in Boyer et al. [2001c]. By replacing (P· xi,j ) with Xi,j , the linear program can be written as in Figure 8. Minimum cost network flow problems have specialized graph algorithms [Ahuja et al. 1993; Goldberg and Tarjan 1987; Lawler 1976] for their resolutions. These algorithms should always run faster than using general tools to solve linear programs. By careful mathematical transformations, one can obtain from the primal in Figure 8 a dual, which is a formulation of a minimum cost network flow problem. 6. MULTI-PHASE CLOCKED CIRCUITS WITH NONZERO CLOCK SKEW Clock skew generally limits the performance of synchronous systems. However, intentional clock skew can be introduced to improve the performance of clocked circuits. In Boyer et al. [2001b], a relation was established between the tolerable ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Scheduling and Optimal Register Placement for Synchronous Circuits

199



Fig. 9. A mathematical formulation to the problem  in the case of nonzero clock skew.

skew, , the minimal clock period P at which the circuit operates, and the maximal value of the distance, Dist, between the schedule of any two adjacent registers. It was shown that: i) when minimum delays are considered as zero, we have  = (P − Dist)/4,

(47)

and ii) the maximum tolerable skew can be obtained when Dist = Dmax , where Dmax is the propagation delay of the slowest computational element in the circuit. To produce a multiphase clocked circuit operating at the minimal clock period P and having a given tolerance to clock skew , the approach in Boyer et al. [2001b] uses the method in Boyer et al. [2001a]. Also, during register placement, outlined in Section 3, paths are broken whenever their lengths become greater than Dist, instead of when they are greater than P. As reported in Boyer et al. [2001b], it is a question of how to reduce the number of registers that must be placed in the circuit and how to exploit noncritical paths to reduce the number of required registers. The main limitations in Boyer et al. [2001b] come from the use of ASAP or ALAP schedules and from the lack of an exact algorithm for register placement. In other words, the approach in Boyer et al. [2001b] did not address the problem of how to simultaneously implement Steps 2 and 3 of the process presented in Section 1. The approach we have proposed in the previous sections can be extended to the case of nonzero clock skew. Indeed, from Equation (47), we have that: Dist = P − 4 · 

(48)

for a given tolerance to clock skew . By using Equation (48), the mathematical formulation in Figure 6 transforms to the one presented in Figure 9. A mixed integer linear program (MILPskew ) and a linear program (LPskew ) can be derived from the formulation in Figure 9 as we did for obtaining Figures 7 and 8. From inequality (49), we can deduce the following theorem. THEOREM 1. For a clocked sequential circuit operating at the minimal clock period P, a tolerance to clock skew  = P/4 requires an infinite number of registers. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

200



N. Chabini et al.

Table I. Register Placement by Boyer et al. [2001a] vs. by the Proposed MILP

Figure 2 SOIR Filter [Bennour 1996] Polynomial Divider [Bennour 1996] Correlator [Boyer et al. 2001a] FOWDEF [Kung 1985] ∗

# Registers Placed by Boyer et al. [2001a] 6* 3 9 6 14

# Registers Placed Using Figure 7 5 2 4 6 8

Relative Gain 16.66% 23.33% 55.55% 0% 42.85%

ASAP schedule is used.

PROOF.

From inequality (5), we have that: ∀ei,j ∈ E, s0 (j) − s0 (i) + P · w(ei,j ) ≥ d(i) > 0.

(55)

By definition, we have that: ∀i ∈ V, mi ≥ 0.

(56)

By (55) and (56), we can deduce that: ∀ei,j ∈ E, s0 (j) − s0 (i) + P · w(ei,j ) + mi > 0.

(57)

By (57), the numerator of (49) is always positive. Hence, when  tends to P/4, we have that ( − P/4) tends to 0. Consequently, xi,j tends to infinity when  tends to P/4. In this case, the number of required registers is infinity. 7. EXPERIMENTAL RESULTS Our main objective in this section is to test the effectiveness of our approach in reducing the area of designs derived as presented in previous sections. To this end, we assess the MILP in Figure 7 and the corresponding linear program (LP) in Figure 8 (a heuristic that provides approximate solutions to the problem). Since we conjecture that the problem we address is NP-hard in its general form, it is normal that one may want to also evaluate the run-time of any heuristic to the problem on large circuits. In this experimentation, circuits from the ISCAS89 benchmark suite [ISCAS’89 Benchmark suite 1996] are used to test the efficiency of the LP in terms of reduction of the number of registers inserted in the circuit, and of the run-time. The mathematical formulations for each circuit are automatically generated by a module we coded in C++. We did not implement the cited polynomial-time algorithms for linear programs, but the Lp Solve tool (in the public domain) is used to solve the generated mathematical formulations. Obtained results are given in Tables I and II, where the first column gives the name of the circuit and the second column presents the number, N1 , of registers placed, using the algorithm in Boyer et al. [2001a] that uses ALAP as a schedule. The number, N2 , of registers placed by the proposed MILP, or by the proposed LP, are presented in the third column. The fourth column gives the relative gain defined as ((N1 − N2 )/N1 ) × 100%. For Table II, the fifth column gives the run-time in seconds on an UltraSparc 10 with 1GB RAM. As Table I reports, significant reductions of the number of required registers are obtained, although register sharing is not used in this article. Substantial ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Scheduling and Optimal Register Placement for Synchronous Circuits



201

Table II. Register Placement by Boyer et al. [2001a] vs. by the Proposed LP

Figure 2 SOIR Filter [Bennour 1996] Polynomial Divider [Bennour 1996] Correlator [Boyer et al. 2001a] FOWDEF [Kung 1985] s344 s641 s1423 s5378 s9234 ∗

# Registers Placed by Boyer et al. [2001a] 6* 3

# Registers Placed Using Figure 7 5 3

Relative Gain 16.66% 0%

Run-Times(s) 0.01 0.02

9

4

55.55%

0.02

6

6

0%

0.01

14

12

14.28%

0.06

131 142 422 1033 1042

53 32 216 581 466

59.54% 77.46% 48.81% 43.75% 55.27%

1.63 1.25 17.31 181.4 93.29

ASAP schedule is used.

Table III. Register Placement with Clock Skew: Result by Boyer et al. [2001b] vs. LPskew

s344 s641 s1423 s5378 s9234

# Registers Placed by Boyer et al. [2001b] 456 487 2252 1449 3167

# Registers Placed Using Figure 7 221 388 953 1373 1431

Relative Gain 51.53% 20.32% 57.68% 5.24% 54.81%

Run-Times(s) 1.19 1.35 17.24 163.08 82.28

reductions are also obtained using the LP. Indeed, as summarized by Table II, reductions as high as 77.46% are obtained in less than 181.4s. The effectiveness of our approach with nonzero clock skew is also tested. We have experimented with the linear program LPskew obtained as presented in Section 6, and compared the results with the method proposed in Boyer et al. [2001b]. As a target skew, we used  = (P−Dist)/4, where P is the minimal clock period, Dist = Max (P/5, Max v∈V (d(v))), and d(v) is the delay of the computational element v. Recall that, as stated in Boyer et al. [2001b], Dist = Maxv∈V (d(v)) gives the maximum tolerance for clock skew. We have used P/5 in expression Dist = Max (P/5, Maxv∈V (d(v))) since, if Dist = P/5, we will have  = P/5, which is close to P/4 that leads to an infinite number of registers based on Theorem 1. Table III reports obtained results using some circuits from the ISCAS89 benchmark suite [ISCAS’89 Benchmark suite 1996]. The first column gives the name of the used circuits. Column 2 presents the number, Nb1 , of registers placed by the method in Boyer et al. [2001b], using ALAP as a schedule. The number, Nb2 , of registers placed by LPskew is given in column 3. The fourth column gives the relative gain defined as ((Nb1 − Nb2 )/Nb1 ) × 100%. The execution time for solving LPskew is presented in the fifth column. As summarized by Table III, reductions of the number of required registers as high as 57.68% are obtained. Execution times for solving LPskew are 1.19 s–163.08 s. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

202



N. Chabini et al.

8. CONCLUSIONS Data dependency constraints constitute a lower bound P on the clock period of synchronous sequential circuits. This lower bound can be determined by solving an instance of the well-known Cost-to-Time Ratio Cycle Problem on the graph, modeling the circuit. To the best of our knowledge, the methods based on basic retiming can fail in transforming single-clocked sequential circuits to functionally equivalent clocked sequential circuits with clock period equal to P. The latter circuits are in general multiphase and can always be obtained by methods based on software pipelining techniques. These methods can be framed in the following four-steps process: Step 1, determine P; Step 2, compute a valid periodic schedule of the computational element; Step 3, place registers back to the circuit; Step 4, assign the clock signals to control the registers. In contrast to basic retiming, the registers are placed in the circuit independently of their original location in the original circuit. In this article, we have addressed the problem of how to simultaneously realize Steps 2 and 3 of the above process in order to minimize the number of registers. This problem was an open problem. We have proposed the first formulation in the literature to solve the problem. From this formulation, we devised a mixed integer linear program and a linear program to determine solutions to the problem. Furthermore, we have presented how the proposed approach can handle nonzero clock skew which gives a solution to the problems in Boyer et al. [2001b]. Experimental results on well known benchmarks confirmed the effectiveness of the approach. Indeed, significant reductions of the number of required registers have been obtained in very short run-time, although register sharing is not considered. When the schedule is given, the problem transforms to the problem of how to place the minimal number of registers. Our approach applies to this case as well. We plan to extend the proposed approach to a more general delay model, for instance, the propagation delays of registers and wires. These delays are supposed zero in this article as in, for instance, Boyer et al. [2001a, b, c]. Propagation delays of registers and wires are considered in Soyata et al. [1995] and Soyata and Friedman [1994] for retiming edge-triggered flip-flops singlephase clocked sequential circuits only. It seems that the approach in Soyata et al. [1995] and Soyata and Friedman [1994] cannot be applied to multiphase circuits derived as presented in this article. The problem of retiming under loaddependent delay constraints is address in Lalgudi and Papaefthymiou [1995], where polynomial time methods are proposed for special cases of this problem. It is a challenging problem in its general form. Since the problem we address is similar to the problem of pipelining a cyclic design, we believe that the problem of using a general delay model does not take the same form as for the case of methods based on basic retiming. We are working on extending the proposed approach to deal with the case when propagation delays of computational elements are not constant. At this moment, software pipelining techniques are used for minimizing the clock period of circuits doing regular processing only. We plan to extend ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Scheduling and Optimal Register Placement for Synchronous Circuits



203

these techniques to circuits that contain conditional paths such as if-then-else branches. ACKNOWLEDGMENTS

The authors would like to thank the four anonymous reviewers and the Associate Editor for their valuable comments, from which this article benefited. REFERENCES AHUJA, R.-K., MAGNANTI, T.-L., AND ORLIN, J.-B. 1993. Network Flows: Theory, Algorithms, and Applications, Prentice Hall, Englewood Cliffs, NJ. ALLAN, V., JONES, R. B., LEE, R. M., AND ALLAN, S. J. 1995. Software pipelining. ACM Comput. Surv., 27, 3. BENNOUR, I.-E. 1996. Estimation de la performance et m´ethodes d’allocation dans la synth`ese de syst`emes num´eriques. Th`ese de doctorat. DIRO, Universit´e de Montr´eal. BENNOUR, I.-E. AND ABOULHAMID, E.-M. 1995. Les probl`emes d’ordonnancement cycliques dans la synth`ese de syst`emes numeriques. Technical Report 996 (Oct.). DIRO, Universit´e de Montr´eal. http:// www.iro.umontreal.ca/∼aboulham/pipeline.pdf. BOYER, F.-R., ABOULHAMID, E.-M., SAVARIA, Y., AND BOYER, M. 2001a. Optimal design of synchronous circuits using software pipelining techniques. ACM Trans. Design Autom. Electr. Syst. 6, 4, 516– 532. BOYER, F.-R., ABOULHAMID, E.-M., AND SAVARIA, Y. 2001b. Minimizing sensitivity to clock skew variations using level sensitive latches. In Proceedings of European Conference on Circuit Theory and Design (Aug.). Espoo, Finland. BOYER, F.-R., ABOULHAMID, E.-M., AND SAVARIA, Y. 2001c. An efficient verification method for a class of multi-phase sequential circuits. The 7th IEEE International Conference on Electronics, Circuits and Systems. Lebanon. CHABINI, N., ABOULHAMID, E.-M., AND SAVARIA, Y. 2001. Reducing register and phase requirements for synchronous circuits derived using software pipelining techniques. In Proceedings of IEEE Computer Society Workshop on VLSI (April). Orlando, FL. CORMEN, T. H., LEISERSON, C. E., AND RIVEST, R. L. 1990. Introduction to Algorithms. McGraw Hill, New York, NY. DASDAN, A. AND GUPTA, R. K. 1998. Faster maximum and minimum mean cycle algorithms for system performance analysis. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 17, 10 (Oct.). DEOKAR, R. B. AND SAPATNEKAR, S. S. 1995. A fresh look at retiming via clock skew optimization. In Proceedings of the ACM/IEEE Design Automation Conference, 304–309. FISHBURN, J. P. 1990. Clock skew optimization. IEEE Trans. Comput. 39, 945–951. FRIEDMAN, E. G., LIU, X., AND PAPAEFTHYMIOU, M. C. 1999. Minimizing sensitivity to delay variations in high-performance synchronous circuits. Proceedings of Design Automation and Test in Europe, 643–649. GEREZ, S.-H., DE GROOT, S.-M. -H., AND HERRMANN, O.-E. 1992. A polynomial-time algorithm for the computation of the iteration-period bound in recursive data-flow graphs. IEEE Trans. Circuits and Syst. 39, 1. GOLDBERG, A.-V. AND TARJAN, R.-E. 1987. Solving minimum-cost flow problems by successive approximation. Proceedings of the 19th Annual ACM Symposium on Theory of Computing (May). New York, NY. ISCAS’89 BENCHMARK SUITE. 1996. Department of Computer Science. North Carolina State University. Available at http:// www.cbl.ncsu.edu/benchmarks/. ISHII, A.-T., LEISERSON, C.-E., AND PAPAEFTHYMIOU, M. C. 1997. Optimizing two-phase, level-clocked circuitry. J. ACM 44, 1 (Jan.). KARMAKAR, N. 1984. A new polynomial-time algorithm for linear programming. Combinatorica 4. KHACHIAN, L.-G. 1979. A polynomial algorithm in linear programming. Soviet Math. Doklady 20. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

204



N. Chabini et al.

KUNG, S. Y., WHITEHOUSE, H. J., AND KAILATH, T. 1985. LSI and Modern Signal Processing, PrenticeHall Inc., Englewood Cliffs, NJ. 259–60. LALGUDI, K. N. AND PAPAEFTHYMIOU, M. C. 1995. Efficient retiming under a general delay model. Advanced Research in VLSI: Proceedings of the 1995 Chapel Hill Conference (March). LAWLER, E.-L. 1976. Combinatorial Optimization: Networks and Matroids, Holt, Reinhart, and Winston, New York, NY. LEGL, C., VANBEKBERGEN, P., AND WANG, A. 1997. Retiming of edge-triggered circuits with multiple clocks and load enables. IEEE/ACM International Workshop on Logic Synthesis (IWLS’97). LEISERSON, C. E. AND SAXE, J. B. 1991. Retiming synchronous circuitry. Algorithmica 6, 5–35. LI, Y. -M. AND JABORI, M.-A. 1992. A zero-skew clock routing scheme for VLSI circuits. In Proceedings of the International Conference on Computer-Aided Design. LIU, X., PAPAEFTHYMIOU, M. C., AND FRIEDMAN, E. G. 2002. Retiming and clock scheduling for digital circuit optimization. IEEE Trans. Comput.-Aided Design 21, 2, 184–203. LOCKYEAR, B. AND EBELING, C. 1994. Optimal retiming of level-clocked circuits using symmetric clock schedules. IEEE Trans. Comput.-Aided Design of Integr. Circuits Syst. 13 (Sept.), 1097– 1109. The LP Solve Tool. Available at ftp://ftp.ics.ele.tue.nl/pub/lp solve/. MAHESHWARI, N. AND SAPATNEKAR, S. S. 1997. An improved algorithm for minimum area retiming. In Proceedings of the ACM/IEEE Design Automation Conference, 2–6. MAHESHWARI, N. AND SAPATNEKAR, S. S. 1999. Optimizing large multiphase level-clocked circuits. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 18, 9 (Sept.) 1249–1264. SAPATNEKAR, S. S. AND DEOKAR, R. B. 1996. Utilizing the retiming-skew equivalence in a practical algorithm for retiming large circuits. IEEE Trans. Comput.-Aided Design 15, 10 (Oct.), 1237– 1248. SHENOY, N. AND RUDELL, R. 1994. Efficient implementation of retiming. IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA. SOYATA, T., FRIEDMAN, E. G., AND MULLIGAN, J. H. 1995. Monotonicity constraints on path delays for efficient retiming with localized clock skew and variable register delay. Proceedings of the IEEE International Symposium on Circuits and Systems (May.), 1748–1751. SOYATA, T. AND FRIEDMAN, E. G. 1994. Retiming with non-zero clock skew, variable register, and interconnect delay. Proceedings of the IEEE International Conference on Computer-Aided Design (Nov.), 234–241. TSAY, R.-S. 1993. An exact zero-skew clock routing algorithm. IEEE Trans. Comput.- Aided Design 12 (Feb.), 242–249. Received January 2003; revised March 2003, October 2003; accepted October 2003

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Synthesis of Skewed Logic Circuits AIQUN CAO Synopsys, Inc. NARAN SIRISANTANA Intel Corporation and CHENG-KOK KOH and KAUSHIK ROY Purdue University

Skewed logic circuits belong to a noise-tolerant high-performance static circuit family. Skewed logic circuits can achieve performance comparable to that of Domino logic circuits but with much lower power consumption. Two factors contribute to the reduction in power. First, by exploiting the static nature of skewed logic circuits, we can alleviate the cost of logic duplication which is typically required to overcome the logic reconvergence problem in both Domino logic and skewed logic circuits. Second, a selective clocking scheme can be applied to a skewed logic circuit to reduce the clock load and hence, clock power. In this article, we propose a two-step synthesis scheme of skewed logic circuits. In the first step, an integer linear programming-based approach is presented to overcome the logic reconvergence problem in skewed logic circuits with minimal logic duplication cost. In the second step, a dynamic programming-based heuristic is applied to achieve an optimal selective clocking scheme. Experimental results show that the average power saving of skewed logic circuits over Domino logic circuits is 41.1%. Categories and Subject Descriptors: B.6.3 [Logic Design]: Design Aids—Automatic synthesis, Optimization; B.7.1 [Integrated Circuits]: Types and Design Styles—Advanced technologies, Standard cells, VLSI General Terms: Algorithms, Design Additional Key Words and Phrases: Skewed logic, synthesis, optimization, power

This work was supported in part by NSF (CCR-9984553), and SRC Hewlett-Packard Research Fellowship. Preliminary versions of this article appeared in the Proceedings of the International Symposium on Low Power Electronics and Design, 2001 [Sirisantana et al. 2001], the Proceedings of the International Symposium on Quality Electronic Design, 2002 [Cao et al. 2002], and the Proceedings of Asia South Pacific Design Automation Conference, 2003 [Cao et al. 2003]. Authors’ addresses: A. Cao, Synopsys, Inc., 700 East Middlefield Rd., Mountain View, CA 94043; N. Sirisantana, Intel Corporation, M/S RA3-256, 2501 NW 229th Ave., Hillsboro, OR 97124; C.-K. Koh, and K. Roy, School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907-1285. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected].  C 2005 ACM 1084-4309/05/0400-0205 $5.00 ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005, Pages 205–228.

206



A. Cao et al.

1. INTRODUCTION Domino logic has been a popular choice for designs that demand very high performance [Krambeck et al. 1982; Goncalves and Man 1983]. However, two inherent drawbacks of Domino logic limit its further use as technology continues to scale. First, the noise margin of Domino logic circuits is very small because it is dependent on the threshold voltages of transistors. That makes Domino logic circuits extremely susceptible to failures due to threshold voltage variation and noise injection. Moreover, the threshold voltage scaling increases subthreshold current exponentially, making Domino logic even more susceptible to noise. In other words, Domino logic circuits do not scale well with the technology. Second, Domino logic dissipates much more power than static circuits; it does not suit low power operation well. To overcome the drawbacks of Domino logic, a new noise-immune highperformance logic style, called skewed logic [Somasekhar 1999; Solomatnikov et al. 2000] or Monotonic Static (MS) CMOS logic [Thorp et al. 1999b; Thorp et al. 1999a], has been proposed. Skewed logic circuits are fully complementary static CMOS logic, with the size of pull-down network (PDN) decreased and that of pull-up network (PUN) increased, or vice versa, for fast low-to-high or fast high-to-low transition, respectively. Sizing the PDN and PUN to favor onedirection transition is referred to as skewing [Somasekhar 1999; Solomatnikov et al. 2000]. Similar to Domino logic, skewed logic is operated in precharging-evaluation fashion for high performance. Fast transition is used for evaluation, while slow transition is used for precharging. Thus, skewed logic is comparable in speed to Domino logic. At the same time, skewed logic has better noise immunity than Domino logic due to its static nature. However, skewed logic suffers from the same logic reconvergence problem that plagues Domino logic. In Thorp et al. [1999a], the synthesis of skewed logic is performed by employing logic duplication to transform the network into a unate representation as in Domino logic synthesis. Moreover, each gate is clocked for precharging-evaluation operations as in Domino logic. Consequently, the synthesized skewed logic in [Thorp et al. 1999a] does not have a power advantage over Domino logic. In this article, we show that by exploiting the static nature of skewed logic circuits, we can achieve a substantial reduction in the amount of logic that must be duplicated for overcoming the logic reconvergence problem. Specifically, we propose a two-step synthesis scheme for skewed logic circuits with the objective of minimizing total power consumption. In the first step, an integer linear programming-based approach is presented to overcome the logic reconvergence problem in skewed logic circuits with minimal logic duplication cost. In the second step, a dynamic programming-based heuristic is applied to achieve an optimal selective clocking scheme to reduce the clocking power. Experimental results show that the power saving of skewed logic circuits over Domino logic circuits is 41.1% on average. The rest of the article is organized as follows. Skewed logic circuits are described in Section 2. Section 3 formulates the synthesis problem of skewed logic circuits. The two-step synthesis scheme is presented in Section 4 ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Synthesis of Skewed Logic Circuits



207

Fig. 1. Two-input (a) NAND, (b) NOR, and (c) clocked NAND skewed logic gates.

and 5. Section 6 provides experimental results and Section 7 concludes the article. 2. SKEWED LOGIC CIRCUITS 2.1 Topology A skewed logic gate has the same circuit topology as a classical static CMOS gate, except that the sizes of PMOS and NMOS transistors are properly adjusted such that the gate favors a particular transition direction. Changing the ratio between PMOS and NMOS transistors of the gate from its original value R = W p /Wn (W p and Wn being the channel width of PMOS and NMOS transistors, respectively) to a new value R ′ = W p′ /Wn′ to favor one transition direction is referred to as skewing. R ′ /R is referred to as skew value of the gate. Note that the overall gate size (transistor width) and gate capacitance are invariant under skewing [Somasekhar 1999]. The structures of a skewed-down NAND gate (for fast high-to-low transition) and a skewed-up NOR gate (for fast low-to-high transition) are shown in Figure 1(a) and (b), respectively. 2.2 Operation Operating as precharging-evaluation circuitry, skewed logic circuits can achieve performance comparable to that of Domino logic circuits. A skewed gate is precharged to the logic value that allows only fast transitions in the evaluation phase. For fast evaluation, skewed-down gates are followed by skewed-up gates, and vice versa. Precharging can be accomplished either by clocked skewed logic gates which precharge just like Domino gates (see Figure 1(c)), or by the propagation of precharged logic values through the logic chain originating from a clocked gate. For the second option, it is important that the precharging of skewed logic gates between two clocked gates does not exceed the precharge phase of the clock period. Thorp et al. [1999a] considered clocking of all skewed gates, whereas Sirisantana et al. [2001] and Solomatnikov et al. [2000] explored the second option by clocking skewed gates selectively (see Figure 2). Precharging through propagation allows a smaller number of clocked gates. It leads to a lower clock load and hence potentially lower clock power consumption than Domino circuits. Determining the selective clocking of skewed gates for low ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

208



A. Cao et al.

Fig. 2. A chain of skewed logic gates between two latches. The skew direction is indicated by the arrow.

Fig. 3. Logic reconvergence problem in (a) a Domino logic circuit, (b) a skewed logic circuit.

power operation is an integral step of the synthesis of skewed logic circuits. We will present a dynamic-programming based technique that determines the appropriate skew values and the optimal clocking scheme in Section 5. 2.3 Logic Reconvergence Problem To ensure fast evaluation, alternating skew directions should be assigned to successive logic gates, that is, skewed-down gates are followed by skewed-up gates and vice versa. Consequently, skewed logic encounters the same problem as Domino logic: logic reconvergence. The conventional synthesis paradigm for Domino logic requires that a unate representation from a logic network be generated. That is accomplished by pushing inverters to the primary inputs. With the presence of reconvergent paths in the logic network, an inverter may be trapped in the reconvergent paths as shown in Figure 3(a). Similarly, the reconvergent paths in a skewed logic network can render the assignment of alternating skew directions impossible, as shown in Figure 3(b). For the synthesis of Domino logic, this problem can be overcome by duplicating the fan-in cone of the reconvergent paths [Prasad et al. 1997], i.e., transforming a binate logic network to a unate one. Thorp et al. [1999a] applied this method for synthesizing skewed logic circuits. This approach will at most double the size of the circuit [Reddy 1973]. The duplication cost can be reduced by proper output phase assignment [Puri et al. 1996; Zhao and Sapatnekar 2000] but the achievable reduction is limited. Both the logic duplication technique and the output phase assignment technique for Domino logic synthesis can be applied to the synthesis of skewed logic. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Synthesis of Skewed Logic Circuits



209

Fig. 4. (a) Reconvergent paths with a NAND gate. (b) Nonalternating skew directions.

Sirisantana et al. [2001] made use of pass transistors to alleviate the logic reconvergence problem. However, it did not consider the glitch caused by the pass transistors which could affect the performance of circuits. In Cao et al. [2002], that was corrected and the static nature of skewed logic was exploited to reduce the logic duplication penalty. The key is that, under certain conditions, assigning nonalternating skew directions to skewed logic gates does not impede the performance of circuit. In other words, the reconvergence problem can be overcome by assigning nonalternating skew directions. In Cao et al. [2002], each pair of reconvergent paths was solved independently; such an approach restricted the solution space since there might be many reconvergent paths overlapping with each other in real circuits. The solution space is further reduced by the restriction that only one gate is permitted to have nonalternating skew direction in each pair of reconvergent paths. In this article, we eliminate those two restrictions to achieve even more improvement on logic duplication and thus power dissipation. We consider the overlapping of reconvergent paths and allow more than one gate to have nonalternating skew direction for each pair of reconvergent paths to overcome the logic reconvergence problem. 2.4 Nonalternating Skew Directions A careful analysis of the static skewed logic circuit reveals that we can make use of the NAND gate B in Figure 4(a) to avoid logic duplication as follows [Cao et al. 2002; Kim et al. 2000]: we make both gate B and gate D skewed-down gates. The other fan-in to gate B, gate A, and the fan-out of gate B, gate F , are both skewed up, as shown in Figure 4(b). The skew directions of gates D and B are nonalternating. Although gates D and B are skewed in the same direction, gate B still properly precharges (to logic 1) because of gate A, which precharges to logic 0 as shown in Figure 5 (The precharge value is shown after each gate in the figure). In other words, we make use of the fact that gate B is a static CMOS circuit to accommodate nonalternating skew directions. As skewed logic achieves its performance by allowing only fast transitions during the evaluation phase, such a scheme requires that if gate D switches, it switches earlier than gate A. Otherwise, a glitch will appear at the output of gate B, and that will affect the performance of the circuit. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

210



A. Cao et al.

Fig. 5. Transition of a NAND gate.

Fig. 6. Skew directions for (a) candidate NOR gate, (b) candidate NAND gate.

Note that for any NAND gate that precedes a skewed-up gate, we can always skew down the NAND gate and its faster fan-in gate as in Figure 5 so that only fast transitions occur during the evaluation phase. When the NAND gate has more than two fan-in gates, nonalternating skew directions can be assigned as long as the skewed-down fan-in gates switch earlier than the fan-in gates that are skewed up if they do switch. A similar analysis will reveal that the skew direction assignment shown in Figure 6(a) with the NOR gate as the middle gate also achieves the same effect. Similar configurations can also be obtained for AND and OR gates. We refer to the gates that allow nonalternating skew directions as candidate gates. The key limitation here is that when nonalternating skew directions are considered, the only skew direction feasible for the fan-out gate of a NAND (NOR) gate is skewed up (down). Consider the case of a NOR gate. If the fanout gate is skewed-up, it has to precharge to logic 0, which indicates that the NOR gate has to precharge to logic 1, and therefore all fan-ins to the NOR gate have to precharge to logic 0. It is obvious that the nonalternating skew direction scheme cannot eliminate all of the logic reconvergence problem in any circuit due to such a limitation. Logic duplication is still needed for the unresolved logic reconvergence problem. In Section 4, we formulate the problem of skew direction assignment for the minimization of logic duplication cost as an integer linear program. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Synthesis of Skewed Logic Circuits



211

3. SYNTHESIS PROBLEM OF SKEWED LOGIC CIRCUIT The skewed logic circuit synthesis problem can be stated as follows: Given a mapped logic network, implement it as skewed logic by determining the skew direction and skew value of each gate, and an optimal clocking scheme, such that the timing constraint is satisfied and the total power consumption is minimized. Here, we assume that the generic cell library would contain NAND, NOR, and inverters for mapping the logic network. Unfortunately, it is extremely difficult to determine the skew direction and value of each gate at the same time. The reason is that the network topology may change due to logic duplication when we assign skew directions. Without a fixed network topology, it is difficult to find the skew value for each gate as the loads of the gates are not fixed. In this article, we apply a twostep approach to the synthesis of skewed logic, that is, the synthesis problem is divided into two subproblems: skew direction assignment followed by skew value and clocking optimization. The problem of skew direction assignment is that of determining the skew direction of each gate such that the logic reconvergence problem is overcome with minimal logic duplication. The problem of skew value and clocking optimization is that of determining the skew value of each gate and an optimal clocking scheme such that the given timing constraint is satisfied, and the total circuit power dissipation is minimized. Although the two-step approach may degrade the overall quality of the solution, experimental results indicate that significant power reductions can be achieved. We present the details of the solutions to the two subproblems in the next two sections. 4. SKEW DIRECTION ASSIGNMENT WITH INTEGER LINEAR PROGRAM (ILP)-BASED APPROACH We state the problem of skew direction assignment as follows: Given a logic network, determine the skew direction of each gate by using both alternating and nonalternating schemes such that the amount of logic duplication is minimized. Note that the output phase assignment problem in Puri et al. [1996] and Zhao and Sapatnekar [2000], which is NP-hard, is a special case of this problem. In this section, we transform the alternating and nonalternating schemes into integer linear constraints among the edges of the network so that an edgebased ILP formulation is generated. The ILP is solved with the objective of minimizing the logic duplication cost under those constraints. A simplification technique to reduce the number of constraints is presented. Afterwards, we present heuristics to solve the ILP efficiently. Methods to further reduce the logic duplication cost are discussed. 4.1 Definitions Given any pair of reconvergent paths, we refer to the node from which the reconvergent paths depart as the divergent node, and the node at which the paths converge, the convergent node. For simplicity, we say that a pair of reconvergent paths forms a reconvergent cycle, or simply a cycle, in the rest of this article, although there are no cyclic paths in the directed acyclic graph defined by the ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

212



A. Cao et al.

Fig. 7. Two petals.

logic network. We denote the cycle with divergent node A and convergent node Z as CAZ . We denote the set of edges in cycle C as E(C). Given two cycles C and C ′ , if E(C) ∩ E(C ′ ) = ∅, we say that C and C ′ belong to a petal [Dey et al. 1990]. Furthermore, given three cycles C, C ′ , and C ′′ , if E(C)∩ E(C ′ ) = ∅, and E(C)∩ E(C ′′ ) = ∅, C, C ′ , and C ′′ belong to the same petal. In other words, a petal consists of all the cycles that share edges among them. Figure 7 shows two petals, one formed by many cycles, and the other by a single cycle. The cycles in Figure 7 also illustrate the fact that some cycles are the union of two or more other cycles (after removing the shared edges in these cycles). For example, the union of CAE and CBL forms the cycle CAL (after removing the shared edge eBE ). We call CAL a composite cycle and CAE and CBL simple cycles, in that they are not the result of the union of other cycles. Given an edge e, we denote the head (destination) and tail (source) of the edge as He and Te , respectively. Edges with the same head are sibling edges. If He is a candidate gate, and Te is not the slowest among all the fan-ins of He , edge e is a candidate edge. 4.2 An Edge-Based Formulation Here, we use the assignment of skew directions to gates in a cycle to illustrate the proposed edge-based integer linear program (ILP) approach. Along the two paths of the cycle, there may be more than one candidate gate where nonalternating skew directions can be applied. In our formulation, candidate edges are (binary) variables in the ILP. Edge e is assigned value 0 if He is in the same skew direction as Te , and value 1 otherwise. All noncandidate edges implicitly have a value of 1. P is defined as the parity of an edge or a path. For an edge, its parity is the same as the value it is assigned; for a path, its parity is 0 if the sum of the values of all the edges along the path is an even number, and 1 if the sum is an odd number. There are three types of constraints that we have to capture in the edge-based ILP formulation. (i) Gate constraints originate from the definition of candidate gates: the nonalternating fan-ins of a candidate gate must be faster than the alternating ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Synthesis of Skewed Logic Circuits



213

Fig. 8. Three feasible assignments of edge variables.

Fig. 9. Path constraints.

fan-ins. Therefore, when a candidate edge is determined to be of value 1, all the slower sibling candidate edges must also have value 1. Figure 8 shows an example of the gate constraint where e1 and e2 are two candidate edges feeding into a candidate gate, and e1 is faster than e2 . Three possible cases for the value of two candidate edges e1 and e2 are enumerated in Figure 8(a)–(c). A single inequality can capture the gate constraint between e1 and e2 as follows: e2 − e1 ≥ 0. (ii) Path constraints stem from the fact that there is only one nonalternating skew direction for each candidate gate. For example, the only nonalternating skew direction for a candidate NOR gate is to skew up. Consider the simple example in Figure 9, where e1 and e2 are the only two candidate edges (dashed line) along a path. Gate A and D are two candidate gates. Gate A (D) can only be skewed-up (skewed-down) for the nonalternating scheme since it is a NOR (NAND) gate. However, if nonalternating skew directions are applied to both of them, that would result in a conflict as shown in Figure 9. In other words, e1 and e2 cannot both be assigned 0. That constraint is captured by the following inequality: e1 + e2 ≥ 1. Consider two successive candidate edges between which there are no other candidate edges. (However, there might be noncandidate edges between the two successive candidate edges.) In general, if the head gates of these two successive candidate edges are of different types (NAND and NOR), the number of noncandidate edges between them must be odd in order to avoid conflicts. If the candidate gates are of the same type (NAND or NOR), there should be an even number of noncandidate edges between them in order to avoid conflicts. Unfortunately, not only do conflicts exist between two successive candidate edges, they also exist among all candidate edges along one path. Consider three successive candidate edges, e1 , e2 , and e3 in Figure 10. The two constraints for ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

214



A. Cao et al.

Fig. 10. Conflicts among three successive candidate edges.

e1 and e2 , and e2 and e3 are: e1 + e2 ≥ 1, e2 + e3 ≥ 1,

(1) (2)

respectively. If e2 is assigned value 0, the constraints among the three candidate edges can be captured by the preceding two constraints. If e2 is of value 1, however, it acts like a noncandidate edge, and the circuit degenerates to the case in Figure 9. Therefore, we would require a new constraint for e1 and e3 : e1 + e3 ≥ 1 if e2 = 1,

(3)

e1 + e3 − e2 ≥ 0.

(4)

which is equivalent to So, there are three constraints in total for these three candidate edges. It turns out that the three inequality constraints among e1 , e2 , and e3 , Equations (1), (2), and (4) can be adequately captured by a single inequality constraint as follows: e1 + e2 + e3 ≥ 2,

(5)

since at most one of the three candidate edges can have value 0. Extending to n successive candidate edges along one path, there are O(n2 ) number of inequalities among them. However, they can be replaced and captured by one single inequality: en + en−1 + · · · + e1 ≥ n − 1, which implies that only one candidate edge among those n edges can have value 0. In general, given any three candidate edges (not necessarily successive) with head gates A, B, C (in that logic order), there are four possible scenarios: (1) A conflicts with B, B conflicts with C; hence, A conflicts with C. Here, A conflicts with B implies that A and B cannot both be assigned nonalternating skew directions. (2) A does not conflict with B, B does not conflict with C; hence, A conflicts with C. (3) A conflicts with B, B does not conflict with C; hence, A does not conflict with C. (4) A does not conflict with B, B conflicts with C; hence, A does not conflict with C. We have shown how the conflicts that arise in the first scenario can be efficiently captured by one single inequality constraint. In fact, given any path ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Synthesis of Skewed Logic Circuits



215

in a simple cycle, one can always partition the candidate edges into at most two groups with edges in a group conflicting with each other but not with edges in the other group. Let j be the number of pair-wise conflicting candidate edges in a group. We further partition the group into k subgroups such that every subgroup contains only successive candidate edges. Based on the simplification technique presented earlier, the conflicts within every subgroup can be captured by one inequality constraint. The number of additional inequality constraints to capture the relations between these k subgroups is  k  k(k−1) . We provide the proof in the Appendix. Without the simplification, = 2 2 the number of inequality constraints among these j candidate edges would be j 2 . The experiments conducted in this article show that k is about a third of j . (iii) Cycle constraints concern candidate edges on two different paths in a cycle. In general, we can generate cycle constraints by treating them as path constraints. We define a path, via the divergent node, between the two candidate edges in the cycle. Now, we can express the cycle constraints as the path constraints, and similar simplification technique applies. Moreover, we have another constraint: the original two paths of the cycle must have the same parity, that is, the number of alternating skew directions along the two paths must satisfy: e1 + · · · + em + c − e1′ − · · · − en′ − c′ = 2i,

(6)

where c − c′ − n ≤ 2i ≤ c − c′ + m, and i is integer. Here, e j is the candidate edge along one reconvergent path in a cycle, and ek′ is the candidate edge along the other reconvergent path in the same cycle. Constants c and c′ are the numbers of noncandidate edges along the two paths, respectively. i is an integer variable to be solved in the ILP. Equation (6) arises from the logic reconvergence problem that we try to solve, whereas other constraints arise from the limitation of the nonalternating skew direction scheme. The above three types of constraints are applicable to both simple cycles and composite cycles. Fortunately, we have to consider only simple cycles. In the following, we argue that the reconvergence problems of all composite cycles are solved when those of simple cycles are solved. The reconvergence problem of a cycle is solved if the two paths of the cycle have the same parity. Now we apply induction on the number of simple cycles that a composite cycle contains. We can prove that the two paths of the composite cycle have the same parity if that is true for all the simple cycles that it contains. Take the cycles CAE and CBL in Figure 7, for example. Once the constraints specified by cycles CAE and CBL are satisfied, the reconvergence problems of them are solved directly. In other words, the paths A − C − D − E and A − B − E, B − E − K − L and B − J − L have the same parity: PACDE = PABE , PBEKL = PBJL . ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

216



A. Cao et al.

For the two paths A − B − J − L and A − C − D − E − K − L of the composite cycle CAL , we have: PABJL = PABE + PBJL − PBE = PACDE + PBEKL − PBE = PACDEKL , which means that the reconvergence problem of the composite cycle AL is solved as well. 4.3 Skew Direction Assignment First, we consider the skew direction assignment within a petal. Clearly, we can formulate the skew direction assignment problem as an ILP in which the variables are the candidate edges, and the constraints are the gate, path, and cycle constraints formed by simple cycles within the petal. However, there is one compelling reason to abandon the preceding suggested approach. The suggested approach will produce a “YES/NO” answer, of which the “NO” answer is of no value as it does not provide a partial result to our synthesis problem. Rather, we would like to identify in a petal as many cycles as possible whose reconvergence problems can be resolved by nonalternating skew direction assignment. The other cycles would have their fan-in cones duplicated to overcome the logic reconvergence problems. To achieve this, we use the following heuristic that incrementally solves a one-cycle ILP formulation at a time. Suppose we already have a feasible solution for n cycles in the petal. We use the assignment specified in the feasible solution to determine the constraints for the (n+1)th cycle. For example, if the n-cycle solution assigns value 1 to a candidate edge in the (n + 1)th cycle then this candidate edge becomes a noncandidate edge in the ILP formulation for the (n + 1)th cycle. Otherwise, it imposes constraints on the problem of skew direction assignment of the (n+1)th cycle even though it is no longer a variable. If the ILP formulation for the (n + 1)th cycle produces a feasible solution, we proceed to the (n + 2)th cycle. Otherwise, we solve a large ILP that considers all (n + 1) cycles simultaneously. Here, the constraints of the large ILP are the union of the constraints of those (n + 1) cycles. If the ILP produces a feasible solution, we proceed to the (n + 2)th cycle. Otherwise, we duplicate the fan-in cone of the (n + 1)th cycle and proceed to the (n + 2)th cycle. Note that when the fan-in cone of the (n + 1)th cycle is duplicated, all reconvergence problems in the fan-in cone are also resolved. In the incremental ILP formulation, we optimize the following objective in each ILP:  min ci × ei , i

where ei is the candidate edge, and ci is the number of different simple cycles sharing the candidate edge ei . The idea here is that we want to reduce the number of constraints in the ILP of later cycles. The objective function maximizes the number of candidate edges assigned with value 1 (i.e., alternating skew directions), especially for candidate edges shared by many cycles. Now, we consider the assignment of skew directions to the gates not in any petals. Some of these gates may lie along the path connecting two petals, as ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Synthesis of Skewed Logic Circuits



217

Fig. 11. Path connecting 2 petals.

shown in Figure 11. If alternating skew direction assignment along such a path results in conflicts, it is easy to find a candidate edge along this path to which we can apply nonalternating skew directions. If such a candidate edge does not exist, it is necessary to identify appropriate fan-in cones (with the least overall duplication cost) to be duplicated. 4.4 Logic Duplication As described in the previous section, logic duplication is still needed when the ILP-based approach cannot solve all of the logic reconvergence problem or no candidate edges exist along a path connecting two petals. Normally, when a node needs to be duplicated, the nodes within its whole fan-in cone need to be duplicated. Figure 12(a) shows a divergent node D and its fan-in cone to be duplicated. The structure after duplication is shown in Figure 12(b). However, we can still reduce the duplication cost even when duplication of the fan-in cone is unavoidable. The reason is that the nonalternating skew direction scheme can be applied to the duplicated fan-in cone, too. In Figure 12(c), for example, we do not have to duplicate node W and its fan-in cone when node Z ′ (duplicate of node Z ) is a candidate gate, that is, node Z ′ allows nonalternating fan-in. When a fan-in cone of node Y is duplicated, the duplication is performed in the direction opposite to that of the original DAG until the primary inputs are reached. The duplication is performed with a breadth-first traversal that starts from node Y . The duplicate is assigned to a skew direction opposite to the original, both of which are determined according to the skew directions of their own fan-out gates. Once a candidate gate X is met during the duplication process, the nonalternating scheme is applied, and no duplication is required for the candidate gate X and its fan-in cone. The skew directions of the gates in the fan-in cone of X can be determined accordingly. 5. SELECTIVE CLOCKING SCHEME Now that we have determined the skew direction of each gate, the next step is to determine the skew value, that is, the ratio of the PMOS and NMOS transistor sizes, of each gate. We assume that a library of skewed logic gates are given. Every logic gate has a few implementations corresponding to different skew values in the library. However, the overall gate size and gate capacitance are identical for different implementations of a logic gate. Different skew values result in different precharge and evaluation delays, denoted respectively by tpc and tev , and thus, different clocking schemes. We apply a dynamic programming-based heuristic algorithm to find the clocking solution. The goal is to find the skew ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

218



A. Cao et al.

Fig. 12. Applying the nonalternating scheme when duplicating.

values of the gates such that the total number of clocked gates is minimized, while satisfying the precharge and evaluation time constraints. First, we consider the special case when the netlist has a tree structure with a primary output as a root. Second, we consider the case when the netlist is a directed acyclic graph. Finally, further reduction of clocked gates of the selective clocking scheme is discussed. 5.1 Netlist With a Tree Structure With a tree structure, it is natural to apply a dynamic programming-based approach to assign skew values to skewed gates and to determine the location of clocked gates. We apply the dynamic programming approach in a bottom-up fashion, that is, from the first-level logic gates, which should be clocked (as shown in Figure 13). Before we present the algorithm, we first define precharge slack and evaluation slack. Consider a skewed gate N . Let CFI be the set of clocked gates that ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Synthesis of Skewed Logic Circuits



219

Fig. 13. Tree with leaf nodes connected to clock.

are in the fan-in cone of N (including N ) such that there exist no clocked gates along the paths from N ′ ∈ CFI to N . The precharge propagation delay is the sum of the longest precharge delay from gates in CFI to N and the precharge delay of N . The precharge slack at N is obtained by subtracting the precharge propagation delay from the given duration of the precharge period. The evaluation propagation delay is the sum of the longest evaluation delay from the primary inputs to N , and the evaluation delay of N . The evaluation slack at N is obtained by subtracting the evaluation propagation delay from the given duration of the evaluation period. Given a tree T , we associate each node in T with a list of triples: {( p1 , e1 , 0), . . . , ( pk , ek , 0), ( pk+1 , ek+1 , 1), . . . , ( pm , em , 1)}. For the triple ( pi , ei , bi ), pi is the precharge slack, ei is the evaluation slack, and bi is a binary value: “0” means that the gate is not clocked, and “1” means that the gate is clocked. We construct the list in a bottom-up fashion as follows. Let N be a leaf node in T . Let Tpc and Tev be the given durations of precharge and evaluation periods, respectively. For every skew value of N , we create a tuple ( p, e, 1), where p = Tpc − tpc , where tpc is the precharge delay of N , and e = Tev − tev , where tev is the evaluation delay of N . The third term in the triple is always “1,” as N is a primary input gate. Consider an internal node N in T . We construct the list of triples of N from its child nodes. For simplicity, we assume that N has two child nodes with the following two lists of triples: ′ ′ ′ {( p1′ , e1′ , b1′ ), ( p2′ , e2′ , b2′ ), . . . , ( pm , em , bm )},

{( p1′′ , e1′′ , b1′′ ), ( p2′′ , e2′′ , b2′′ ), . . . , ( pn′′ , en′′ , bn′′ )}. For a particular skew value of N , let tpc and tev be the precharge and evalu′ ′ be the precharge and evaluaand tev ation delay if N is not clocked, and let tpc tion delay if N is clocked. Therefore, the two triples obtained when we combine ( pi′ , ei′ , bi′ ) with ( p′′j , e′′j , b′′j ) are: (min( pi′ , p′′j ) − tpc , min(ei′ , e′′j ) − tev , 0), (Tpc −

′ tpc ,

min(ei′ , e′′j )



and ′ tev ,

1).

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

220



A. Cao et al.

Fig. 14. Deriving precharge and evaluation slacks of shared nodes.

There are in total O(m ∗ n ∗ s) combinations for N , where s is the total number of skew value choices for N , m and n are the number of triples of the child nodes of N . As a result, there are an exponential number of combinations for the entire tree. Fortunately, various pruning criteria can be deployed to simplify the search process. Whenever a triple has a negative slack, for example, it constitutes an infeasible solution and should be discarded. Moreover, consider two triples of a node: ( pi , ei , bi ) and ( p j , e j , b j ). We observe that if bi = b j , pi ≤ p j , and ei ≤ e j , then ( pi , ei , bi ) is an inferior solution; it can be eliminated from the list. Not only can the total number of combinations for N be reduced to O((m + n) ∗ s), the run-time of the algorithm can also be reduced accordingly [Stockmeyer 1983]. After traversing the whole tree, we select among the triples at the root node one that requires the least number of clock number. The skew values for the remaining nodes can be obtained by a top-down traversal of the tree. 5.2 General Logic Network When the logic network is a directed acyclic graph, it is very difficult to apply dynamic programming to find appropriate skew values and a clocking scheme; for example, the sizing of Domino gates [Zhao and Sapatnekar 1998]. A common technique, which we adopt here as well, is to first decompose the netlist into trees. Several ways of decomposing a netlist into trees are available. In this article, we first compute the logic depth of all primary outputs. We find the fan-in cone of the primary output with the largest logic depth and use that to define a tree rooted at the primary output. The leaf nodes of the tree are all primary inputs. Therefore, the evaluation time of the tree is set to be the given evaluation period Tev , and the dynamic programming-based approach described in the previous section is applied. We remove the nodes in the tree from the netlist and proceed to extract trees from the remaining netlist in a similar fashion. If the root of the extracted tree is a primary output and the leaf nodes are all primary inputs, the evaluation time of the tree is also set to be Tev . If the root and/or leaf nodes of the extracted tree are shared by previously extracted trees, we derive the new precharge and evaluation slacks of those nodes from the results of the previously extracted trees. For example, Figure 14(a) shows the precharge slack p and evaluation slack e of shared nodes ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Synthesis of Skewed Logic Circuits



221

Fig. 15. Eliminating the last stage clock.

according to previously extracted trees, while the new p and e of those nodes are shown in Figure 14(b). From Figure 14, we can see that the precharge slacks are kept the same, whereas we have to deduct the evaluation slack of the root from the evaluation slacks of those nodes. The dynamic programming-based approach is applied to the extracted tree and the procedures repeat until all nodes have been considered. 5.3 Further Reduction of Clocked Gates If we assume a 50-50 duty cycle for the clock, there is a further step to cut down the number of gates connected to the clock. After obtaining the clocking scheme as outlined in the previous section, we can always eliminate the clock connected to the gates in the CFI of the primary outputs. As Figure 15 shows, the clock connected to gate 7 in Figure 2 can be deleted without affecting the correct operation of the circuit. After removing the clock, if gates 7, 8, and 9 finally keep the precharge value, they still have the entire duration of the evaluation period to perform the precharge properly. Otherwise, they would still evaluate correctly. If the duty cycles are uneven, only some clock gates in the CFI of primary outputs may be unclocked. 6. EXPERIMENTAL RESULTS We have implemented the two-step synthesis scheme presented in Sections 4 and 5 in C++ language. We use a standard ILP package “lp solve” to solve the ILPs. Ten ISCAS benchmark circuits are used for experiments running on Sun Ultra10, 440 MHz, 512M RAM. The solving of ILPs is dominating in run-time. For each benchmark circuit, the total number of petals, the total number of cycles, the average number of cycles in a petal, the maximum number of cycles in a petal, the total number of integer variables, the average number of integer variables in a cycle, and the maximum number of integer variables in a cycle are listed in Table I. Using the simplification technique presented in Section 4.2, average reduction in the number of inequality constraints and the run-time are 2.8x and 5x, respectively (See Table I). For comparison, the ISCAS benchmark circuits are implemented as skewed logic circuits and Domino circuits. For skewed logic circuits, the library contains inverters, up to 4-input NAND and NOR gates, with six different skew values each (values 3, 4, and 5 for both skewing up and skewing down). By using SIS [Sentovich et al. 1992], the benchmark circuits are mapped to the generic library gates (NAND, NOR, and inverters) without determining skew directions or values first. After that, the two-step approach is applied. For Domino circuits, we restrict the library of gates to contain only inverters, 2-input NAND, and ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

222



A. Cao et al.

Table I. Comparison of ILP Solving Run-Time Before and After the Simplification of Constraints Circuit C432 C499 C880 C1355 C1908 C2670 C3540 C5315 C6288 C7552 Avg # petals 4 20 18 33 21 25 19 33 16 27 – # cycles 46 134 151 315 445 489 977 692 1272 1163 – avg # cycles 12 7 8 10 21 20 51 21 80 43 – per petal max # cycles in 27 69 62 98 134 110 292 248 411 355 – a petal # integer 192 358 311 492 620 667 1315 949 1780 1911 – variables avg # integer 6 7 6 7 7 6 9 7 8 8 – variables per cycle max # integer 11 21 16 20 27 29 59 34 55 68 – variables in a cycle # constraints (before 1084 3462 2007 4146 7539 7927 30402 14276 29751 38874 – simplification) ILP time(min) (before 16 52 48 92 180 198 1154 496 1255 1304 – simplification) # constraints (after 533 1625 1023 1792 2676 2354 8222 4678 14292 9425 – simplification) ILP time(min) (after 4 15 18 31 49 40 145 88 201 168 – simplification) Reduction of # 2x 2.1x 2x 2.3x 2.8x 3.4x 3.7x 3.1x 2.1x 4.1x 2.8x constraints Reduction of 4x 3.5x 2.7x 3x 3.7x 5x 8x 5.6x 6.2x 7.8x 5x time

up to 6-input NOR gates, all with minimum sizes. Note that logic duplication is used to resolve the issue of logic reconvergence. We synthesize the Domino logic circuits first and obtain the critical delay of each circuit. We use that as the delay constraint when the dynamic programming-based approach is used to determine the skew values of the corresponding skewed logic circuit. Therefore, the delays of the Domino logic circuit and the skewed logic circuit of each benchmark are almost identical. All experiments use 0.35 µm CMOS technology at a supply voltage of 3.3 V. The effective channel widths for PMOS and NMOS transistors when unskewed (or when skew value = 1) are 4.5 µm and 1.8 µm, respectively. Power and delay for each circuit are obtained by using PowerMill and PathMill, respectively. The results are summarized in Table II. In Table II, “SL” stands for skewed logic circuit synthesized using our method and “DL” for Domino logic circuit. From Table II, we observe that our skewed logic implementation significantly reduces the amount of logic duplication required when compared with that required by the Domino logic implementation (18.6% versus 68.0%). This contributes to substantial power savings. The average power saving of skewed logic over Domino is 41.1%. (We could have used the methods in Puri et al. [1996] and Zhao and Sapatnekar [2000] to reduce the duplication cost for Domino logic. Puri et al. [1996] and Zhao and Sapatnekar [2000] reported about 15% reduction in the duplication cost. (If the reduction rate were applicable to the benchmark circuits used in this experiment, the average duplication cost of Domino logic would reduce to about 55%). ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Synthesis of Skewed Logic Circuits



223

Table II. Comparison Among Skewed Logic Synthesized Using Our Method (SL), Domino Logic (DL), Static CMOS (SC), and Skewed Logic Synthesized Using Method in Thorp et al. [1999a]

Circuit

Type SL C432 DL SC [Thorp et al. 1999a] SL C499 DL SC [Thorp et al. 1999a] SL C880 DL SC [Thorp et al. 1999a] SL C1355 DL SC [Thorp et al. 1999a] SL C1908 DL SC [Thorp et al. 1999a] SL C2670 DL SC [Thorp et al. 1999a] SL C3540 DL SC [Thorp et al. 1999a] SL C5315 DL SC [Thorp et al. 1999a] SL C6288 DL SC [Thorp et al. 1999a] SL C7552 DL SC [Thorp et al. 1999a] SL Average DL SC [Thorp et al. 1999a]

% Circuit Clock Total # Logic Power Power Power Delay Area Gates Duplication (mW) (mW) (mW) (ns) (x1000 µm2 ) 336 20.0 18.11 8.04 26.15 2.1 37.248 510 82.1 38.27 11.97 50.24 2.1 51.014 280 – 13.82 – 13.82 3.6 29.773 498 78 40.44 18.62 59.06 2.0 57.365 713 14.8 49.91 18.31 68.22 3.5 69.506 1097 76.7 102.28 33.96 136.24 3.6 90.384 621 – 44.08 – 44.08 5.3 56.014 1066 71.6 98.06 40.42 138.48 3.3 93.116 488 6.3 50.28 20.27 70.55 3.3 51.519 695 46.3 89.40 28.28 117.68 3.3 65.839 459 – 46.21 – 46.21 5.9 46.355 686 49.4 90.25 42.02 132.27 3.3 71.195 799 21.1 63.54 22.65 86.19 3.5 80.606 1126 70.6 101.11 33.89 135.00 3.7 98.228 660 – 52.99 – 52.99 6.5 63.480 1101 66.9 97.87 49.33 147.20 3.4 110.801 898 10.5 64.61 19.11 83.72 3.8 78.014 1416 69.0 109.33 32.05 141.38 3.7 105.860 812 – 57.64 – 57.64 6.2 67.128 1428 75.7 120.08 46.66 166.74 3.8 135.689 1385 16.3 89.05 36.37 125.42 4.5 89.533 1990 46.6 184.73 56.18 240.91 4.7 102.265 1191 – 75.34 – 75.34 7.8 74.708 1949 63.7 160.04 80.31 240.35 4.2 118.330 1794 14.2 119.90 41.24 161.14 4.8 129.124 2787 77.4 217.49 66.65 284.14 4.8 179.346 1571 – 102.92 – 102.92 8.7 107.785 2840 80.8 202.15 93.38 295.53 4.7 177.468 3069 10.9 149.01 56.26 205.27 5.1 187.073 4114 48.6 241.73 78.32 320.05 5.2 224.165 2767 – 131.75 – 131.75 9.0 161.277 4088 47.7 242.30 108.53 350.83 4.9 234.220 4257 58.4 118.76 41.84 160.60 8.3 370.057 5159 91.9 155.16 58.71 213.87 8.4 413.458 2688 – 88.66 – 88.66 13.9 228.429 5025 86.9 166.87 99.55 266.42 8.0 470.513 4288 13.7 243.84 110.47 354.31 5.8 231.741 6425 70.4 460.65 168.95 629.60 5.6 313.534 3772 – 189.09 – 189.09 8.8 194.742 6388 69.4 438.35 237.71 676.06 5.8 360.088 1803 18.6 96.70 37.46 134.16 4.47 132.442 2532 68.0 170.02 56.90 226.91 4.51 164.409 1482 – 80.25 – 80.25 7.57 102.969 2507 69.0 165.64 81.65 247.29 4.34 182.879

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

224



A. Cao et al.

Fig. 16. Subgroups S1 , S2 and T1 .

In order to compare our work with previous work on skewed logic synthesis, we also implement the method in Thorp et al. [1999a]. In Thorp et al. [1999a], logic duplication is used to resolve the issue of logic reconvergenece. Moreover, every gate is clocked. When compared with the results of the method in Thorp et al. [1999a], the logic duplication of our method is 18.6% versus 69.0%, and the average power saving over Thorp et al. [1999a] is 45.7%. This is because the method in Thorp et al. [1999a] has similar duplication cost to Domino logic and even more clock power than Domino logic. However, its delay is slightly better since the gates on the critical path can have the largest available skew values due to the fact that each gate is clocked. We also synthesize the Static CMOS circuit for each benchmark with 0.35 µm CMOS technology by using Silicon Ensemble. The results for Static CMOS circuits are also included in Table II. In comparison, Static CMOS circuits have the smallest area and power dissipation, but the largest delay. 7. CONCLUSION In this article, we propose a two-step synthesis scheme for skewed logic circuits. An ILP-based approach overcomes the logic reconvergence problem in skewed logic circuits with minimal logic duplication cost. A dynamic programmingbased selective clocking scheme reduces the clock load. These two factors contribute to produce the reduction in total power consumption. APPENDIX THEOREM. Suppose a group of pair-wise conflicting candidate edges S can be divided into two subgroups, S1 , and S2 , with each subgroup consisting of successive candidate edges only: S1 = {si | 1 ≤ i ≤ m}, S2 = {s′j | 1 ≤ j ≤ n}. The candidate edges between subgroup S1 and S2 form subgroup T1 (Figure 16) of another group of pair-wise conflicting candidate edges T : T1 = {tk | 1 ≤ k ≤ l }. There is no conflict between S and T . The conflicts within subgroup S1 and S2 can be captured by one inequality constraint, respectively: s1 + s2 + · · · + sm ≥ m − 1,

(A-1)

s1′ + s2′ + · · · + sn′ ≥ n − 1,

(A-2)

whereas the number of inequalities between two candidate edges within different subgroups is (m × n). However, these (m × n) inequalities can be replaced by one ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Synthesis of Skewed Logic Circuits



225

single inequality which captures the same constraints. The single inequality has the form: s1 + · · · + sm + s1′ + · · · + sn′ − t1 − · · · − tl ≥ m + n − l − 1.

(A-3)

PROOF We start from the simple case with n equal to 1. The set of inequalities that capture the constraints, denoted by Z , consists of the single inequality constraint within S1 and the m inequality constraints between S1 and S2 :

s1 +

s1′

s1 + · · · + sm ≥ m − 1, ′ sm + s1 − t1 − · · · − tl ≥ −(l − 1), sm−1 + s1′ − t1 − · · · − tl − sm ≥ −l ,

(A-4) (A-5) (A-6)

.. .

(A-7)

− t1 − · · · − tl − sm − · · · − s2 ≥ −(l + m − 2).

(A-8)

In the left-hand side of Equation (A-3), si and s′j have the same polarity, whereas tk has the opposite polarity. Multiplying each of the preceding (m + 1) inequalities ((A-4)-(A-8)) with a proper positive coefficient and summing all of them, we can derive an inequality of the same form as Equation (A-3) as follows: am+1 × (A-4) + am × (A-5) + am−1 × (A-6) + · · · + a1 × (A-8) ≥ am+1 × (m − 1) − am × (l − 1) − am−1 × l − a1 × (l + m − 2), (A-9) where a1 + am+1 = a2 − a1 + am+1 .. . = am − am−1 − · · · − a1 + am+1 = a1 + · · · + am = −(−a1 − · · · − am ).

(A-10)

Note that Equation (A-10) has essentially m equations with (m + 1) variables (a1 , . . . , am+1 ). One solution to the set of equations is: a1 = 1, a2 = 2, a3 = 4, . . . , am = 2m−1 , am+1 = 2m − 2, with the right-hand side of Equation (A-9) being: (2m − 2)(m − 1) −

m−1 

2i (l + m − 2 − i) = (2m − 1)(m − l − 1) + 1.

i=0

Dividing both sides of Equation (A-9) by a1 + am+1 = 2m − 1, we obtain: s1 + · · · + sm + s1′ − t1 − · · · − tl ≥ m − l − 1 +

1 . 2m − 1

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

226

A. Cao et al.



As the left-hand side of the preceding inequality are all binary variables, we can rewrite it as: s1 + · · · + sm + s1′ − t1 − · · · − tl ≥ m − l .

(A-11)

As Equation (A-11) is derived from the inequality set Z , a solution to the set of inequalities Z is therefore a solution to Equation (A-11). Moreover, Equations (A-5) to (A-8) of inequality set Z can be derived from Equation (A-11) also, as shown in the following. As they are all binary variables, the following holds: − 2(sm + · · · + si+1 ) − (si−1 + · · · + s1 ) ≥ −(2m − i − 1).

(A-12)

Adding Equations (A-11) and (A-12), we have: si + s1′ − t1 − · · · − tl − sm − · · · − si+1 ≥ −(l + m − i − 1),

(A-13)

which is the generalization of Equations (A-5) to (A-8). It means a solution to Equation (A-11) is also a solution to Equations (A-5) to (A-8). Therefore, Equations (A-5) to (A-8) can be replaced by Equation (A-11) with the same solutions. Note that Equation (A-4) cannot be derived from (A-11), so it remains without being replaced. Thus, the set of inequalities Z with (m + 1) inequalities can be replaced by a new set of inequalities Z ′ with two inequalities, (A-4) and (A-11). Now, we consider the general case where n > 1. The set of inequalities that capture all the constraints, denoted by X , consists of Equations (A-1), (A-2), and (m×n) inequalities between S1 and S2 . These (m×n) inequalities can be grouped into n subsets, Z 1 , Z 2 , . . . , Z n , each of which consists of m inequalities. Applying the preceding simplification technique applied to Z to subset Z i , i = 1, . . . , n, the set of inequalities X with (m × n + 2) inequalities can be replaced by a new set of inequalities X ′ with (n + 2) inequalities:

s1′

s1 + s2 + · · · + sm ≥ m − 1, s1′ + s2′ + · · · + sn′ ≥ n − 1,

(A-14) (A-15)

sn′ + s1 + · · · + sm − t1 − · · · − tl ≥ m − l , ′ sn−1 + s1 + · · · + sm − t1 − · · · − tl − sn′ ≥ m − (l + 1), .. .

(A-16) (A-17)

+ s1 + · · · + sm − t1 − · · · − tl −

sn′

− ··· −

s2′

≥ m − (l + n − 1).

(A-18) (A-19)

Treating the m elements in subgroup S1 as one single element in the preceding inequalities ((A-14)-(A-19)), we can further replace the n inequalities Equations (A-16) to (A-19) with one single inequality: s1 + · · · + sm + s1′ + · · · + sn′ − t1 − · · · − tl n−1 i (2n − 2)(n − 1) + i=0 2 (m − n − l + 1 + i) ≥ n 2 −1 (m + n − l − 2)(2n − 1) + 1 1 = = m+n−l −2+ n . 2n − 1 2 −1 ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Synthesis of Skewed Logic Circuits



227

It is the same as: s1 + · · · + sm + s1′ + · · · + sn′ − t1 − · · · − tl ≥ m + n − l − 1,

(A-20)

since the variables are binary valued. Therefore, the set of inequalities X ′ is further simplified to a set X ′′ with only 3 inequalities, Equations (A-1), (A-2) and (A-20). For a group S of j pair-wise conflicting candidate edges that can be divided into k subgroups, S1 , . . . , Sk , of successive candidate    edges, the number of inequality constraints can be reduced from 2j to k2 . REFERENCES CAO, A., SIRISANTANA, N., KOH, C.-K., AND ROY, K. 2002. Synthesis of selectively clocked skewed logic circuits. In Proceedings of the IEEE International Symposium on Quality Electronic Design. San Jose, CA. 229–234. CAO, A., SIRISANTANA, N., KOH, C.-K., AND ROY, K. 2003. Integer linear programming-based synthesis of skewed logic circuits. In Proceedings of the IEEE Asia South Pacific Design Automation Conference. Kitakyushu, Japan. 820–823. DEY, S., BRGLEZ, F., AND KEDEM, G. 1990. Corolla based circuit partitioning and resynthesis. In Proceedings of the ACM/IEEE Conference on Design Automation. Orlando, FL. 607– 612. GONCALVES, N. F. AND MAN, H. J. D. 1983. NORA: a race free dynamic CMOS technique for pipelined logic structures. IEEE J. Solid State Circuits 18, 3 (June), 261–266. KIM, C., LEE, J., BAEK, K. H., MARTINA, E., AND KANG, S. M. 2000. High-performance low-power skewed static logic in very deep-submicron (vdsm) technology. In Proceedings of the IEEE International Conference on Computer Design. Austin, TX. 59–64. KRAMBECK, R. H., LEE, C. M., AND LAW, H.-F. S. 1982. High-speed compact circuits with CMOS. IEEE J. Solid State Circuits 17, 3 (June), 614–619. PRASAD, M. R., KIRKPATRICK, D., BRAYTON, R. K., AND SANGIOVANNI-VINCENTELLI, A. 1997. Domino logic synthesis and technology mapping. In Proceedings of the ACM/IEEE International Workshop on Logic Synthesis. Tahoe City, CA. PURI, R., BJORKSTEN, A., AND ROSSER, T. E. 1996. Logic optimization by output phase assignment in dynamic logic synthesis. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design. San Jose, CA. 2–8. REDDY, S. M. 1973. Complete test sets for logic functions. IEEE Trans. Comput. C-22, 11 (Nov.), 1016–1020. SENTOVICH, E. M., SINGH, K. J., LAVAGNO, L., AND ET AL. 1992. SIS: A system for sequential circuit synthesis. Tech. Rep. UCB/ERL M92/41 (May) University of California at Berkeley. SIRISANTANA, N., CAO, A., DAVIDSON, S., KOH, C.-K., AND ROY, K. 2001. Selectively clocked skewed logic (SCSL): A robust low-power logic style for high-performance applications. In Proceedings of the IEEE International Symposium on Low Power Electronics and Design. Huntington Beach, CA. 267–270. SOLOMATNIKOV, A., SOMASEKHAR, D., ROY, K., AND KOH, C.-K. 2000. Skewed CMOS: Noise-immune high-performance low-power static circuit family. In Proceedings of the IEEE International Conference on Computer Design. Austin, TX. 241–246. SOMASEKHAR, D. 1999. Power and dynamic noise considerations in high performance CMOS VLSI design. Ph.D. thesis, Purdue University. STOCKMEYER, L. 1983. Optimal orientations of cells in slicing floorplan designs. Inform. Control 57 (2/3), 91–101. THORP, T., YEE, G., AND SECHEN, C. 1999a. Design and synthesis of monotonic circuits. In Proceedings of the IEEE International Conference on Computer Design. Austin, TX. 569–572. THORP, T., YEE, G., AND SECHEN, C. 1999b. Monotonic static CMOS and dual Vt technology. In Proceedings of the IEEE International Symposium on Low Power Electronics and Design. San Diego, CA. 151–155. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

228



A. Cao et al.

ZHAO, M. AND SAPATNEKAR, S. S. 1998. Technology mapping for domino logic. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design. San Jose, CA. 248–251. ZHAO, M. AND SAPATNEKAR, S. S. 2000. Dual-monotonic domino gate mapping and optimal output phase assignment of domino logic. In Proceedings of the IEEE International Symposium on Circuits and Systems. Geneva, Switzerland. 309–312. Received February 2003; revised September 2003; accepted March 2004

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Optimizing Instruction TLB Energy Using Software and Hardware Techniques I. KADAYIF Canakkale Onsekiz Mart University and A. SIVASUBRAMANIAM, M. KANDEMIR, G. KANDIRAJU, G. CHEN, and G. CHEN The Pennsylvania State University

Power consumption and power density for the Translation Look-aside Buffer (TLB) are important considerations not only in its design, but can have a consequence on cache design as well. After pointing out the importance of instruction TLB (iTLB) power optimization, this article embarks on a new philosophy for reducing the number of accesses to this structure. The overall idea is to keep a translation currently being used in a register and avoid going to the iTLB as far as possible—until there is a page change. We propose four different approaches for achieving this, and experimentally demonstrate that one of these schemes that uses a combination of compiler and hardware enhancements can reduce iTLB dynamic power by over 85% in most cases. The proposed approaches can work with different instruction-cache (iL1) lookup mechanisms and achieve significant iTLB power savings without compromising on performance. Their importance grows with higher iL1 miss rates and larger page sizes. They can work very well with large iTLB structures that can possibly consume more power and take longer to lookup, without the iTLB getting into the common case. Further, we also experimentally demonstrate that they can provide performance savings for virtually indexed, virtually tagged iL1 caches, and can even make physically indexed, physically tagged iL1 caches a possible choice for implementation. Categories and Subject Descriptors: B.3.2 [Memory Structures]: Design Styles—TLB; D.3.4 [Programming Languages]: Processors—Compilers, optimization General Terms: Design, Performance, Reliability Additional Key Words and Phrases: Power consumption, translation look-aside buffer, compiler optimization, cache design, instruction locality

This research has been supported in part by the National Science Foundation (NSF) grants CCR9988164, CCR-9900701, CCR-0097998, DMI-0075572, MIP-9701475, and equipment grant EIA9818327. Authors’ addresses: I. Kadayif, Computer Engineering Department, Canakkale Onsekiz Mart University, Canakkale, Turkey; email: [email protected]; A. Sivasubramaniam, M. Kandemir, G. Kandiraju, G. Chen, and G. Chen, Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802; email: {anand,kandemir,kandiraj, gchen,guilchen}@cse.psu.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected].  C 2005 ACM 1084-4309/05/0400-0229 $5.00 ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005, Pages 229–257.

230



I. Kadayif et al.

1. INTRODUCTION Power optimization has become as important a criterion as performance across a spectrum of computing devices. While the need for conserving battery energy on embedded devices is well understood [Catthoor et al. 1998], power dissipation has a crucial consequence on chip design as well—fabrication, packaging, and cooling. Reducing the power dissipation requires an in-depth examination of each system component, and research over the past few years has been very active in this area [Brooks et al. 2000; Vijaykrishnan et al. 2000]. Power is consumed whenever computing elements are accessed/activated (called dynamic power), and even when the elements are idle (called leakage power). While the latter becomes particularly important with billions of transistors clocked at high frequencies expected to be packed on a single chip, today’s main concern is still the dynamic power, which is proportional to the number of times that the device is exercised. This is particularly the case for small components that are frequently exercized such as the TLB, which is the focus of this study. Several research projects have looked at reducing dynamic power consumption by reducing the device activity. Many of these previous studies have proposed ways of either reducing the number of accesses to a device or to reduce the cost of an access itself (e.g., Inoue et al. [1999], Ghose and Kamble [1999], and Yang et al. [2001] for caches, [Catthoor et al. 1998] for DRAMs, [Zyuban and Kogge 1998; Folegnani and Gonzalez 2001; Parikh et al. 2002] for datapath components). However, there is one specific component, namely, the Translation Look-aside Buffer (TLB), which has not drawn very much attention from the architectural/software angle for power optimization. In fact, this component is much more frequently accessed than DRAMs and many other components. An instruction fetch and data reference go through address translation via the TLB which is a cache of recent virtual-to-physical address translations. Even though this unit is typically kept small (to keep access times low), its associativity is usually high to keep miss rates low. These high-associative storage structures are an important candidate for dynamic power optimization [Choi et al. 2002; Juan et al. 1997]. This is particularly very important for the instruction TLB (iTLB), which is accessed on every instruction memory reference. For instance, Figure 1 shows the dynamic energy consumption breakdown in the storage hierarchy for the different storage components (TLBs, L1 and L2 caches, and off-chip DRAM) of a fourissue superscalar machine in the execution of six Spec2000 applications for a virtually indexed, physically tagged L1 addressing strategy. These numbers show that on the average, 25.3% of the energy in the memory hierarchy is consumed by the instruction TLB (iTLB) alone, which motivates the research presented in this article. The detailed discussion of our experimental setting and benchmarks will be given later in the article. Also, it has been reported that the address translation logic consumes as much as 17% of on-chip power on the Intel StrongARM [Juan et al. 1997] and 15% on the Hitachi SH-3 [Kim 2001]. This is more or less evenly split between the instruction and data parts. Another motivation for our research comes from the power density viewpoint. Keeping power densities (power consumption per unit area) under control is a ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

TLB Energy Using Software an Hardware Techniques



231

Fig. 1. Energy breakdown (for a 0.1 micron technology) across instruction TLB (iTLB), data TLB (dTLB), instruction L1 cache (iL1), data L1 cache (dL1), unified L2 cache (L2), and off-chip memory. The iTLB and dTLB have 32 entries each. iL1 and dL1 are 8KB, 1-way and 8KB, 2-way, respectively. The L2 cache is 1MB, 2-way. The lines sizes for L1 caches and L2 are 32 bytes and 128 bytes, respectively. The main memory is assumed to be a 64MB Direct Rambus RIMM Module that uses low-power modes to save energy. Table I. Power Densities (in nW/mm2 ) for Six Benchmarks (with 0.1 Micron Technology) across On-Chip Memory Components (iTLB, dTLB, iL1, and dL1) Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

iTLB 8.835 5.613 11.083 7.835 7.980 5.575

dTLB 3.070 1.732 1.652 2.969 2.522 2.255

iL1 1.101 0.700 1.382 0.977 0.995 0.695

dL1 0.844 0.504 0.482 0.837 0.728 0.639

The iTLB and dTLB have 32 entries each. iL1 and dL1 are 8KB, 1-way and 8KB, 2-way, respectively. The lines size for L1 caches is 32 bytes.

very important goal for on-chip thermal management [Brooks and Martonosi 2001]. Table I gives the power densities of on-chip memory components of a four-issue superscalar machine in the execution of six Spec2000 applications for a virtually indexed, physically tagged L1 addressing strategy (detailed descriptions of our experimental setup and benchmarks will be given later). We see from this table that the iTLB has the highest power density due to its small size and large dynamic energy consumption for the six applications. Even when one looks at the raw power consumption per access and divide this by the area of the iTLB, its power density comes out to be 7.820 nW/mm2 (obtained using CACTI [Reinman and Jouppi 2000]), compared to 0.975 and 0.670 nW/mm2 for iL1 and dL1, respectively. Currently, there exist several strategies for reducing iTLB power consumption. First, one can attempt to optimize power at the circuit level as has been ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

232



I. Kadayif et al.

conducted in Juan et al. [1997]. Their proposal includes modifications to the basic cells and to the structure of TLBs. The second approach is to reduce the power consumption per access at the architectural level by restructuring the TLBs, for example, using a smaller structure, reducing associativity, or working with multilevel TLBs (the smaller one has lower power and can help as long as we have high hit rates within this level) [Manne et al. 1997]. Choi et al. [2002] propose a two-way banked filter TLB and a two-way banked main TLB. Other techniques along the same line include [Lee et al. 2003; Lee and Ballapuram 2003]. One could even do TLB reorganizations dynamically [Balasubramonian et al. 2000]. While these approaches can reduce the power consumption per access, they do not reduce the number of accesses themselves. Instead, in the third approach, which is the one explored in this article, we try to reduce the number of TLB accesses (instead of reducing per access TLB energy cost) using a mix of software and hardware strategies. It is to be noted that our strategies can be used in conjunction with the other two to produce even higher savings. Similar ideas have been explored in the context of caches [Panwar and Rennels 1995]. We identify at least three different ways of reducing the number of times a TLB is accessed: —Delaying TLB Lookup: If we can make the caches (at least the L1 cache) virtually indexed and virtually tagged (denoted as VI-VT in this article), then we need to access the TLB only on a L1 miss (assuming that L2 is physically addressed). While this may cause an extra cycle latency for L1 misses, it can considerably reduce power requirements. In fact, this is the approach that is used on the Intel StrongARM processor [SA 2002]. One could even try to extend the VI-VT lookup for the L2 cache as well. However, in this case, a hardware implementation can become cumbersome if the L2 is off-chip, and Jacob and Mudge [2001] suggest software-based TLB maintenance. —Implementing the TLB in Software: As mentioned in the previous solution, delaying TLB lookup beyond L2 accesses can lessen the importance of the TLB latency and dynamic power, potentially allowing an implementation in software. This also helps save real-estate on-chip, in addition to power as mentioned in Jacob and Mudge [2001]. If cache misses become high (for some commercial workloads, even L2 misses can be quite important as reported in Ailamaki et al. [1999]), then the performance penalties can offset the benefits of this approach. —Generating Physical Addresses Directly: If the software/hardware can directly provide the physical address of the page being referenced, then we do not need the TLB for that instruction/reference. While this may appear to be a radical shift from the current view (why have a virtual address at all if this is the case?), we believe (and demonstrate in this article) that there are several circumstances when one can correctly generate physical addresses directly, at least for the instruction stream that is the target of our optimizations. This approach can even be used in conjunction with the other two solutions without any loss of generality, and constitutes the underlying philosophy of the mechanisms proposed in this article. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

TLB Energy Using Software an Hardware Techniques



233

There has been some amount of prior work done a long time ago in generating addresses without going to the dTLB for data references [Knight and Rosenfeld 1984; Maddock et al. 1981; Chiueh and Katz 1992], but our focus here is on the instruction stream. In the context of instruction streams, we are only aware of a similar philosophy as this work in the context of the VAX architecture which uses a register to keep translations of the current instruction page, to alleviate TLB lookup latencies [Strecker 1978]. This is similar to one of the strategies that is evaluated in this work (called HoA, as will be detailed within the article), with the focus now on power consumption. Our results will show that while this may do well for performance, it does not give the best power savings. To our knowledge, this is the first article to explore the ability of a program to directly generate physical addresses for instructions towards iTLB power savings in a modern architecture. Such an ability can be used in a system that has a virtually indexed, physically tagged (VI-PT) iL1 cache, to lower iTLB power considerably. It can even lower iTLB power in a system with a virtually indexed, virtually tagged (VI-VT) iL1 cache by reducing lookups upon cache misses. Further, it can save cycles expended in iTLB lookups upon an iL1 miss for a VI-VT iL1 cache where the iTLB is in the critical path. Finally, if we are able to successfully provide translations in most cases, then we may want to even reconsider incorporating physically indexed, physically tagged (PI-PT) caches, which are largely ignored today because of translation getting in the critical path. It should be noted that the mechanisms investigated in this article generate physical addresses directly only when we are absolutely sure (i.e., they are not speculative). Specifically, they target optimizing references to the page that has just recently been referenced. Since there is considerable spatial locality in instruction streams, we believe that one can get substantial savings even with such a simple strategy. One could ask how this philosophy is different from having a very small iTLB (degenerating even to a single entry iTLB). The differences are in that the consequence of a really small iTLB can lead to performance problems (higher miss rate), and there is still a comparison involved in matching tags (consuming power). In contrast, our approach can still work with a reasonably sized iTLB, and can generate the addresses directly in several situations without requiring comparisons, and without incurring any performance penalties. We also show in this article how this approach can be better than a multilevel iTLB from both power and performance angles. The remainder of this article is organized as follows: Background on cache and iTLB lookup mechanisms is presented in Section 2. The details of our approach are given in Section 3. The experimental results are presented in Section 4. Finally, a summary of our findings along with future work is given in Section 5. 2. BACKGROUND: CACHE AND TLB LOOKUP The iTLB needs to be consulted upon a virtual page address to generate the physical page number before eventually going to the DRAM. However, there are caches (L1 and L2) before going to the DRAM, and how these caches are looked ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

234



I. Kadayif et al.

up can have an impact on iTLB performance and power. It should be noted that cache lookup requires an indexing part to determine the set under consideration, and a subsequent tag comparison for the blocks within the set. Either of these can be done with a virtual address or a physical address, leading to four possible combinations: virtually indexed, virtually tagged (VI-VT), virtually indexed, physically tagged (VI-PT), physically indexed, physically tagged (PI-PT), and physically indexed, virtually tagged (PI-VT). The last option (PI-VT) is not really in much use (MIPS R6000 uses it) and is not under consideration here (since it combines the drawbacks of both VI-PT and VI-VT). In this article, we focus on the other three options for the L1 instruction cache (iL1) and assume that L2 is always PI-PT. A brief summary of how these mechanisms work for the different iL1 addressing schemes [Cekleov and Dubois 1997] is given below: —PI-PT iL1: The physical address needs to be obtained before the cache can even be indexed, making the iTLB fall in the critical path.1 This is also a reason why this configuration is not very popular today. In terms of power as well, the iTLB is consulted on every instruction fetch regardless of whether it is in the iL1 or not. The advantage of this scheme is that there are no aliasing problems across different virtual address spaces. This scheme is depicted in Figure 2(a). —VI-PT iL1: One way to remove the iTLB from critical path is to index the sets of iL1 using the virtual address, and iTLB is concurrently looked up to obtain the physical address (which is expected to take less time than the iL1 indexing). Consequently, the tag from the physical address is used for the comparison with the corresponding tag bits of the set. As a result, iTLB is not in the critical path anymore, but the downside is that it is still accessed on every instruction fetch incurring energy costs. The downside with this approach is that write-backs require a translation, and this can be handled by storing the physical indexes/addresses with each block. Many current microprocessors use this configuration (e.g., AMD K6, MIPS R10K, PowerPC), and the lookup mechanism is shown in Figure 2(b). —VI-VT iL1: With this configuration (see Figure 2(c)), iL1 is both indexed and tagged with virtual addresses, implying that iTLB is not required at all until an iL1 miss. One could either lookup iTLB at this time (which may add an extra cycle latency to the iL1 miss path if L2 is PI-PT, but is very good in terms of power), or in parallel with iL1 access (in which case the iTLB lookups are not any different than in a VI-PT iL1). In this study, we use the former strategy in our evaluations since it is more power efficient, and may not suffer significant performance penalties if the iL1 locality is sufficiently 1 There

are several situations where by choosing an appropriate iL1 configuration—such that the cache indexing can work with just the offset within a page, and does not need frame number—one could implement a PI-PT iL1 without making the iTLB fall in the critical path. Some commercial processors (e.g., Sun UltraSPARC II) exploit such scenarios. However, this restricts iL1 configurations and becomes very similar to a VI-PT iTLB lookup, which is evaluated in this article. Consequently, in our PI-PT model, we do not put any restrictions on iL1, and the iTLB needs to be looked up before iL1 indexing. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

TLB Energy Using Software an Hardware Techniques



235

Fig. 2. Three different iL1 look-up strategies: (a) PI-PT, (b) VI-PT, and (c) VI-VT. iTLB lookup is before iL1 indexing in PI-PT, done concurrently with iL1 indexing in VI-PT, and done only on an iL1 miss in VI-VT.

good. The StrongARM is an example of this kind of iL1 indexing [SA 2002]. This strategy has aliasing problems, and the solution is to typically add a few most significant bits to differentiate between address spaces. 3. OUR APPROACHES 3.1 Overall Philosophy As was mentioned earlier, our overall philosophy is to do the translation for a page once, and subsequently keep reusing it directly without going to the iTLB, as long as it does not change. Two ways of achieving this is by: — Storing the translations of several previously visited pages either in hardware (in which case, it is no different than the iTLB itself except maybe a smaller version of it) or in software (in which case, we incur high performance overheads). — Storing just a single translation—namely, the current page—and keep using it as long as we do not leave that page. When we do leave the page, lookup the iTLB for the target page. This is the strategy that is adhered to in all the mechanisms proposed in this work. 3.2 Hardware Support Whenever our mechanism cannot supply a translation, there needs to be a way of triggering an iTLB lookup based on the virtual address. The result of this lookup moves the corresponding iTLB entry (both the physical frame number and the protection bits) into a register called Current Frame Register (CFR), whose format is of the form < Virtual Page Number, Physical Frame Number, Protection/Other Bits > ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

236



I. Kadayif et al.

Fig. 3. iL1 lookup with the presence of CFR assuming the translation is present there (a) PI-PT, (b) VI-PT, (c) VI-VT. See the explanations in Section 3.2 for how this mechanism works.

The trigger mechanism itself (that is done in hardware or software) will be discussed in detail for each of our approaches in the subsequent discussion. Once we have the current physical frame number in the CFR we can do the next instruction fetch as follows depending on the cache addressing mechanism (described earlier in Section 2): —PI-PT iL1: The page offset is obtained from the low order bits of the PC, and the physical frame number is obtained from the CFR. This constitutes the physical address, and the iL1 is looked up with this address. This is pictorially shown in Figure 3(a). —VI-PT iL1: The virtual address is generated by appending the virtual page number in the CFR with the page offset bits of the PC. The index part of this address is used to index iL1. The physical address is generated by appending the physical frame number part of the CFR with the page offset bits of the PC, and the tag part of this result is used to compare the tags in the set that was indexed in iL1. This is shown in Figure 3(b). —VI-VT iL1: We use the PC virtual address entirely to lookup iL1. If we obtain the data from there, then we are done. Only on a miss, we access the CFR to get the physical frame number concatenated with the page offset bits of the PC to lookup L2. This lookup mechanism is shown in Figure 3(c). Our proposals can work in conjunction with each of these cache addressing mechanisms to provide power savings. In fact, it can even provide performance savings in the case of PI-PT and VI-VT caches since the iTLB access can get in the critical path (it is always the case for PI-PT and happens on an iL1 miss for VI-VT). One could even hypothesize that we may want to rethink incorporating PI-PT, which is largely ignored today, as long as our approaches can provide translations in most cases. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

TLB Energy Using Software an Hardware Techniques



237

3.3 OS Support and Related Issues The OS needs to ensure that the current page (the one whose translation is being used currently) is not evicted (i.e., its physical address does not change). This is not expected to be a problem since this page will be a very low candidate for LRU anyway (and we do this for at most 1 page per application process). If so desired, one could ask the OS to invalidate the CFR if this page has to be really evicted/remapped (just as the entry would be invalidated in the iTLB). Note that CFR is not explicitly available to the application program (either for reading or writing), and it is used directly by the hardware. However, in supervisory mode, the OS will be allowed to read/write the CFR (so that this page is not evicted) and maybe reset/invalidate it. Consequently, the program cannot change permissions to a page (which are also in the CFR) without going via the OS. Upon a context switch, the CFR can be treated as yet another register whose context is saved and restored. Also, when using our approach in a multithreaded architecture, one can provide a CFR per thread. It is also important to address the issue of variable page size supported by some systems (e.g., giving support for 4K, 8K, and 16K pages at the same time). In this case, we propose to tune our mechanisms to work with the smallest page size supported by the system. This should work in practice given that the larger page sizes supported by the architecture are usually integer multiples of the smallest page size. 3.4 Strategies 3.4.1 Hardware-only Approach (HoA). This is an approach which does not require any software support. The hardware directly examines virtual addresses generated by the PC, and compares them with the virtual page number part of the CFR. If they match, then the target instruction is in the same page (requiring no translation) and the iL1 lookup is performed as described above. If they do not match, then we force an iTLB lookup. This lookup, in the case of a VI-PT iL1 is done in parallel with the iL1 indexing (incurring an energy cost in the iTLB). In a VI-VT iL1, on the other hand, even if the page numbers do not match, the iTLB is not looked up until an iL1 miss. The hardware that is needed is a comparator that compares the virtual page number produced by the PC and that in the CFR. As mentioned earlier, the VAX uses a similar strategy—holding the current instruction page translation in a register called the IPA—to alleviate TLB lookup latencies [Strecker 1978]. In this evaluation, we are looking at a more modern processor with out-of-order execution and complex control flow structures, and our focus here is on power consumption. The advantage of this approach is that we perform iTLB lookups exactly when needed (very accurate). The downside is the overhead of the comparison (energy cost) on every instruction fetch. We believe that the performance overhead can be hidden from the critical path by performing this operation as soon as the PC is updated (and before the subsequent instruction fetch cycle). Note that this technique is very similar to the concept of block buffering [Ghose and Kamble 1998a, 1998b; Kandemir et al. 2001] used in energy-efficient ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

238



I. Kadayif et al.

cache architectures. The idea in block buffering is to place one or more line buffers in front of the first-level cache to capture data blocks with temporal locality. HoA can be seen as an application of block buffering to TLBs. 3.4.2 Software-only Conservative Approach (SoCA). At the other end of the spectrum, we consider a scheme where all the triggering of the iTLB lookup is done explicitly by the software (i.e., the compiler). The reader should note that there are two ways by which a program execution can move from one instruction page to another: (a) explicit branch instructions whose target is in a different page (we call this the BRANCH case), and (b) two successive instructions which are on page boundaries (we refer to this as the BOUNDARY case), that is, one is the last instruction of a page, and the next is the first instruction of the next page (we assume that instructions are aligned so that a single instruction does not cross page boundaries). Further, we assume that an iTLB lookup is done by the hardware for every target of a branch regardless of whether it crosses a page boundary or not, and all other instructions directly use the CFR. This automatically handles the BRANCH case. To handle the BOUNDARY case, the compiler explicitly inserts a BRANCH instruction at the end of each instruction page, with the target being the very next instruction (the first one on the next page). The advantage of SoCA is that it does not even require the extra logic incurred by the previous mechanism, and there is no extra energy cost in the normal instruction fetch path. The downside is the extra instructions (both cycles and energy) incurred in the BOUNDARY cases (this overhead is negligible). The other problem is that we are being very conservative (which our results will show) in assuming that every branch target is in a different page and this is what the next two schemes try to address. 3.4.3 Software-only Less Conservative Approach (SoLA). In this approach, we take the mechanism explained in Section 3.4.2 and try to be less conservative in the BRANCH cases. Specifically, we want to eliminate iTLB lookups when a static analysis of the code by the compiler can reveal that the branch target is within the same page as the branch instruction itself (this typically occurs when branch targets are given as immediate operands or as PC relative operands). The necessary compiler support is to check whether the target of a statically analyzable branch is on the same page of the branch itself. An implementation of this requires that the hardware be able to distinguish between two types of branches: one is the branch identified by the compiler as being on the same page (which does not go through the iTLB) and the other being the normal branches where the targets go through the iTLB. The first types of branches are called In-Page branches. We use an extra bit in branch instructions to differentiate between in-page branches and the others. One can envision this bit being part of the address itself, indicating whether the branch target needs to be looked up in the iTLB or not. This approach enjoys the benefits of the previous one, with the additional benefit of avoiding iTLB lookups when branch targets are statically analyzable and found to be within the same page as the corresponding branches ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

TLB Energy Using Software an Hardware Techniques



239

Fig. 4. Integration with branch prediction logic. Here, BTB denotes the branch target buffer. BA and BTA correspond to branch address and branch target address, respectively. CFR is the current frame register. VPN, PFN, and PB correspond to three portions of the CFR, namely, virtual page number, physical frame number, and protection bits, respectively.

themselves. However, we are still being quite conservative in that we force lookups even if the targets are within the same page but this cannot be determined at compile time. 3.4.4 Integrated Hardware-Software Approach (IA). While the hardwareonly mechanism is quite accurate in finding out when to go to the iTLB, the downside is the energy cost on every instruction execution. The software-only approach avoids this, but can turn out to be conservative and goes to the iTLB more often than needed. In this section, we propose an integrated approach that can get the better of these two extremes. We can use the compiler-based approach to track the BOUNDARY cases since we are anyway accurate in predicting page transitions at these points with the software schemes. However, we can adopt a hardware mechanism (not the one used in Section 3.4.1) for the BRANCH cases, so that we can use runtime information to determine if the target is really going out of a page (and whether it is taken at all). We implement this within the existing framework of branch predictors. For instance, an implementation of this check with the Branch Target Buffer [Sima et al. 1997] is shown in Figure 4, and is the one evaluated in our studies. The BTB (that is used in several commercial offerings such as Pentium, PA 8000, and PowerPC 620 [Sima et al. 1997]) indexed by the address of the branch instruction, keeps the address of target instruction to be executed next together with additional state information. As soon as the PC of the branch instruction is generated, this table is looked up concurrently with the IF stage of the branch instruction itself. Consequently, the IF of the (likely) branch target is performed in the next cycle if we hit in the BTB. Our enhancements to this mechanism is to simply check if the virtual address (page number bits) coming out of the BTB matches the CFR virtual page number (see Figure 4). If it does, then the iTLB is not used for the target instruction fetch. Otherwise, the iTLB may need to be consulted (not always in a VI-VT cache). ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

240



I. Kadayif et al.

Fig. 5. Pseudocode of iTLB lookups during branch executions in IA.

While the evaluations in this article have been performed with what has been explained here, it is possible to make it work with other types of branch prediction mechanisms as well. The general idea is to wait until a branch target address is available and then perform a comparison of virtual page number with that in the CFR. For example, if a target address based predictor is not used, and the branches are handled with a predecoding mechanism, then the CFR comparison can be employed at that time. The situations when the iTLB is looked up are expressed in pseudocode format in Figure 5. Essentially, we avoid iTLB lookups when the branch target is predicted correctly and the target is within the same page, and default back to a iTLB lookup otherwise. As can be seen in this figure, there are four points of return (A, B, C, and D) from this routine. In (A), there is no iTLB lookup at all. In (B), we incur an iTLB lookup regardless of whether the taken target falls in the same page or not (this is a little conservative, but with high accuracies of branch predictors this may not be a major problem). In (C), we incur an iTLB lookup, but this would definitely be needed since there is a page change (i.e., in this case, there is no escape from updating the contents of CFR). Finally, in (D), we incur an extra iTLB lookup than actually needed in cases where the predictor failed but the target was still on the same page. As a result, we are being a little conservative in the (B) and (D) cases, but these penalties will be bounded by the inaccuracy of the predictor. One could try optimizing this further in future work. There is no performance penalty that is additionally incurred by this mechanism. None of these mechanisms affect iL1 and L2 hits or misses, and thus they do not affect the rest of the memory system energy consumption. Further, in a VI-PT cache, these mechanisms will not affect the execution cycles either. In a VI-VT cache, our mechanisms are expected to help (rather than hurt) performance by possibly reducing address translation overheads on an iL1 miss. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

TLB Energy Using Software an Hardware Techniques



241

Table II. Default Configuration Parameters used in our Experiments Simulation Parameter

Value Processor Core RUU Size 64 instructions LSQ Size 32 instructions Fetch Queue Size 8 instructions Fetch Width 4 instructions/cycle Decode Width 4 instructions/cycle Issue Width 4 instructions/cycle (out-of-order) Commit Width 4 instructions/cycle (in-order) Functional Units 4 Integer ALUs 1 Integer multiply/divide 4 FP ALUs 1 FP multiply/divide Memory Hierarchy L1 Instruction Cache (iL1) 8KB, 1-way, 32 byte blocks, 1 cycle latency L1 Data Cache (dL1) 8KB, 2-way, 32 byte blocks, 1 cycle latency L2 1MB unified, 2-way, 128 byte blocks, 10 cycle latency Instruction TLB (iTLB) 32 entries, full-associative, 50 cycle miss penalty Data TLB (dTLB) 128 entries, full-associative, 50 cycle miss penalty Instruction Page Size 4KB Data Page Size 4KB Off-Chip Memory (DRAM) 128MB (divided into 32MB banks), 100 cycle latency Branch Logic Predictor Bimodal with 4 states BTB 1024 entry, 2-way Misprediction Penalty 7 cycles Per Access Dynamic Energy Values iL1 0.259nJ dL1 0.645nJ L2 8.581nJ iTLB 0.149nJ dTLB 0.376nJ All dynamic energy values are for a 0.1 micron process technology (under 1V supply voltage), and are obtained from the CACTI tool.

4. PERFORMANCE RESULTS 4.1 Experimental Setup In this section, we present a detailed energy and performance evaluation of the optimization strategies proposed in this work. Unless stated otherwise, we use the processor architecture whose parameters are listed in Table II. All energy numbers are for 0.1 micron technology and are obtained using the CACTI tool [Reinman and Jouppi 2000]. The architectural parameters given in this table define our default configuration. Later in this section, we change some of these parameters to evaluate different issues. It should be noted that many modern embedded processors are superscalar architectures. To test the effectiveness of our strategies, we used six benchmarks from Spec2000 benchmark suite. Spec2000 is industry-standardized, benchmark suite, designed to provide a comparative measure of performance across a wide range of hardware [Henning 2000]. We selected six benchmarks from this suite, ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

242



I. Kadayif et al.

Table III. Benchmarks Used in Our Experiments and Their Important Characteristics Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

VI-PT Cycles (M) iTLB Energy 188.1 109.1 331.7 124.1 169.3 112.7 263.1 134.5 161.3 112.2 293.9 108.4 Number (Percentage) of Dynamic Branches 23.6 (8.9%) 36.4 (12.6%) 50.8 (18.6%) 38.7 (12.3%) 19.9 (7.3%) 43.2 (16.6%)

VI-VT iL1 Cycles (M) iTLB Energy Miss Rate 196.1 3.345 0.002 350.5 8.385 0.014 176.6 3.040 0.011 274.7 5.221 0.010 165.6 2.005 0.006 310.5 6.345 0.027 Page Crossing BOUNDARY BRANCH 99016 (1.77%) 5503671 (98.23%) 86925 (1.09%) 7969935 (98.91%) 13513 (0.11%) 12168347 (99.89%) 312314 (1.99%) 15344827 (98.01%) 722028 (11.31%) 5662714 (88.69%) 577674 (5.75%) 9473056 (94.25%)

All values are obtained using the default configuration. The percentage in the second column in the bottom part shows the percentage of branch instructions of the total instructions executed. The actual page crossings for the BOUNDARY and BRANCH cases are shown, and their relative percentage of contribution to the crossings. The numbers in columns two and four in the upper part, and in column two in the bottom part are in millions. All energy values are in millijoules.

and after skipping the first 1 billion instructions, simulated the next 250 million instructions. The important characteristics of these benchmarks are given in Table III. The reason that we selected these six benchmarks is because they stress the iTLB more than the others due to the relatively worse instruction locality (their iL1 miss rates are higher). The second and third columns in the upper part of Table III give the execution cycles and iTLB energy consumptions of our default configuration when iL1 is VI-PT. The fourth and fifth columns in the upper part give the same information for the VI-VT iL1. The sixth column in the upper part presents iL1 miss rates. The second column in the bottom part gives the number of branch instructions executed and their percentage with respect to the total number of instructions executed. The last two columns in the bottom part give the number of page crossings during execution. This number is divided into two portions: BRANCH case (i.e., the page crossings as a result of a branch instruction) and BOUNDARY case (i.e., the page crossings due to sequential execution on the page boundary). We clearly see that the overwhelming majority of dynamic page crossings are due to branches. All these experiments have been conducted using SimpleScalar [Burger et al. 1996], with the sim-outorder cycle-level model. The execution without any of our optimization mechanisms is referred to as the base execution in the rest of this article, and the iTLB energy numbers (in columns three and five in the upper part of Table III) and execution cycles (in columns two and four in the upper part of Table III) are obtained with this model for the default configuration. SoCA and SoLA require an examination of the assembly code by the compiler to determine the page boundaries and branches. We also compare all our schemes with an OPT execution model, which gives the lowest iTLB energy without any further code transformations. In this model, iTLB energy is consumed only ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

TLB Energy Using Software an Hardware Techniques



243

Table IV. The Number of Dynamic Branches Crossing Page Boundaries with Respect to the Total Number of Instructions 177.mesa 2.2%

186.crafty 3.2%

191.fma3d 4.8%

252.eon 6.1%

254.gap 2.3%

255.vortex 3.7%

Fig. 6. Normalized iTLB energy consumptions. Left: VI-PT Right: VI-VT. These energy values are normalized with respect to that of the base case for each iL1 lookup mechanism as given in Table III (i.e., the base is assumed to be 100).

when there is an actual page change. The number of dynamic branches crossing page boundaries with respect to the total number of instructions is given in Table IV. 4.2 Results We first give in Figures 6 and 7 the iTLB energy consumptions and overall execution cycles of our four strategies (HoA, SoCA, SoLA, and IA) normalized with respect to corresponding values of those for the base case. These schemes are compared with the OPT results. Examining the energy consumption graphs (Figure 6) we see that all our four schemes provide substantial reduction in iTLB energy for both VI-PT and VI-VT. On the average (over all 6 applications), the iTLB energy consumption is reduced to just 5.69%, 12.24%, 5.01%, and 3.82%, for VI-PT and 15.23%, 36.83%, 16.39%, and 14.04% for VI-VT, with HoA, SoCA, SoLA, and IA, respectively. We see that IA comes very close to the OPT energy consumption (3.20% for VI-PT and 12.74% for VI-VT on the average). While the savings in both iL1 addressing strategies are quite good, they are better for VI-PT. This can be explained based on the fact that in a VI-VT iL1, the address translation is done only on a iL1 miss. There is a higher probability (though not always as will be explained later) of the translation missing in our CFR as well (because of the worse locality when this occurs). Still, we should point out that we get over 85% iTLB energy reduction on the average for VI-VT with our IA scheme. We next examine each of our four strategies in closer detail. With HoA, the energy consumption presented in these graphs is because of two factors: the iTLB lookup when the page comparison of the CFR indicates a page crossing, and the energy consumption of the comparison itself which is incurred on every instruction fetch (regardless of whether there is a page crossing or not). The ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

244

I. Kadayif et al.



Fig. 7. Normalized execution cycles for VI-VT. These values are normalized with respect to the execution cycles of the VI-VT base case as given in Table III (i.e., the base is assumed to be 100). We did not observe any significant differences in execution cycles across the schemes, and compared to the base execution, for a VI-PT iL1. Table V. Dynamic Number of iTLB Lookups for SoCA, SoLA, and IA (VI-PT) SoCA Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

BOUNDARY 99016 (0.41%) 86925 (0.23%) 13513 (0.03%) 312314 (0.77%) 722028 (3.40%) 577674 (1.31%)

Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

BOUNDARY 99016 (1.48%) 86925 (0.85%) 13513 (0.10%) 312314 (1.59%) 722028 (9.24%) 577674 (5.28%)

BRANCH 23895619 (99.59%) 37174532 (99.77%) 51083905 (99.97%) 40386387 (99.23%) 20531371 (96.60%) 43422782 (98.69%) IA BRANCH 6590313 (98.52%) 10133921 (99.15%) 14043552 (99.90%) 19277621 (98.41%) 7092915 (90.76%) 10360962 (94.72%)

SoLA BOUNDARY 99016 (0.99%) v86925 (0.66%) 13513 (0.07%) 312314 (1.52%) 722028 (6.83%) 577674 (3.57%)

BRANCH 9893195 (99.01%) 13000618 (99.34%) 19451932 (99.93%) 20268715 (98.48%) 9852715 (93.17%) 15595749 (96.43%)

The numbers in parentheses indicate the contributions of the BOUNDARY and BRANCH cases. Note that the number of page crossings due to the BRANCH cases is higher than the corresponding values in the last column of Table III.

latter factor accounts for the difference between HoA and OPT, and this does turn out to be reasonably significant. As noted earlier, the last two columns in the bottom part of Table III give the actual page crossings incurred during the execution of these applications, broken down into the BOUNDARY and BRANCH cases. We can see that the BRANCH cases typically overwhelm the BOUNDARY cases. Table V gives the page crossings that are forced by the three schemes—SoCA, SoLA and IA—to look up the iTLB (sometimes conservatively). Note that the BRANCH case crossings are higher than the corresponding values in Table III, and the BOUNDARY case crossings are the same (as these strategies differ from the ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

TLB Energy Using Software an Hardware Techniques



245

Table VI. Static and Dynamic Branch Statistics Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

Total 563 2161 532 706 883 2781

Benchmark

Total

177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

23645387 36364110 50803392 38654600 19989582 43248486

Static Statistics Analyzable Page Crossings 472 (83.8%) 117 (24.8%) 1985 (91.8%) 515 (25.9%) 477 (89.7%) 142 (29.8%) 548 (77.6%) 204 (37.2%) 785 (88.9%) 146 (18.6%) 2548 (91.6%) 1078 (42.3%) Dynamic Statistics Analyzable Page Crossings 19175565 (81.1%) 31864428 (87.6%) 44644775 (87.9%) 28783921 (74.5%) 18034984 (90.2%) 37933459 (87.7%)

5173141 (27.0%) 7690514 (24.1%) 13012802 (29.1%) 8666249 (30.2%) 7356328 (40.8%) 10106426 (26.6%)

In-Page 355 (75.2%) 1470 (74.1%) 335 (70.2%) 344 (62.8%) 639 (81.4%) 1470 (57.7%) In-Page 14002424 (73.0%) 24173914 (75.9%) 31631973 (70.9%) 20117672 (69.8%) 10678656 (59.2%) 27827033 (73.4%)

Static statistics (given in the upper portion) are obtained from the source codes. The analyzable column gives the number of branch instructions in the code whose target (when in-page or not) can be detected at compile time. It also shows the contributions of these branches to the total number of branches. The next two columns show how many (and what percentage) of the analyzable branches cross the page boundary or not. The columns in the bottom portion of the table give similar statistics for dynamic execution.

optimal in only how the branches are treated). SoCA turns out to be much worse than OPT and the other three because of its conservative assumption that each branch crosses a page boundary. One can observe that the absolute numbers under the BRANCH column for SoCA in Table V is higher than the corresponding columns for the other schemes, and this is also the more dominating situation compared to the BOUNDARY case as Table III suggests. SoLA, on the other hand, can optimize situations when there is no page crossing if the branch target is available at compile time. Consequently, this reflects on the lower iTLB lookups required by this scheme for the BRANCH cases. Table VI shows the number of static occurrences of the branches whose target is available at compile time (termed “Analyzable” in the table), and this table also shows how many times such branches occur in the dynamic execution. On the average, we find that these dynamic instances amount to 84.8% of the total, and of these 70.4% are within the same page (not requiring a lookup). This does turn out to be a significant fraction of the total branches (nearly 60%), leading to the substantial reduction in energy for SoLA compared to SoCA. Moving on to IA, we note that it is very close to OPT in most cases. As explained earlier, the only points where IA may need extra iTLB lookups over OPT is when the branch prediction is not accurate. Table VII gives the percentage of dynamic branches that were predicted accurately by the branch prediction mechanism. As can be seen from this table, these misprediction rates are less than 15% explaining why IA comes close to OPT. In fact, if we can use a more accurate predictor, IA would come even closer to OPT. Having covered the energy results, we present the execution time results with these schemes for the VI-VT cache in Figure 7. It is to be noted that there is no significant difference in execution cycles with these schemes (compared to ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

246



I. Kadayif et al. Table VII. Branch Predictor Accuracy

Benchmarks →

177.mesa

186.crafty

191.fma3d

252.eon

254.gap

255.vortex

Predictor Accuracy →

94.14%

91.16%

95.82%

85.23%

89.55%

97.38%

Table VIII. Energy Consumptions with Different iTLB Configurations (VI-PT) Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

Base 6.585 7.144 6.804 7.353 6.560 6.722

Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

Base 146.525 166.305 151.138 179.744 150.565 145.956

1-entry OPT IA 0.245 (3.72%) 0.2719 (4.13%) 0.337 (4.73%) 0.391 (5.47%) 0.556 (8.18%) 0.599 (8.81%) 0.642 (8.73%) 0.734 (9.97%) 0.269 (4.10%) 0.296 (4.51%) 0.456 (6.79%) 0.478 (7.11%) 16-entry, 2-way OPT IA 3.070 (2.09%) 3.664 (2.50%) 4.266 (2.56%) 5.405 (3.25%) 6.537 (4.32%) 7.510 (4.96%) 8.814 (4.93%) 10.88 (6.05%) 3.512 (2.33%) 4.264 (2.83%) 6.156 (4.21%) 6.623 (4.53%)

Base 99.256 111.991 102.182 121.118 101.760 98.919 Base 109.075 124.110 112.685 134.544 112.205 108.424

8-entry, FA OPT IA 2.454 (2.47%) 2.843 (2.86%) 3.242 (2.89%) 4.016 (3.59%) 4.475 (4.37%) 5.132 (5.02%) 6.145 (5.07%) 7.544 (6.23%) 2.466 (2.42%) 2.977 (2.93%) 4.392 (4.44%) 4.708 (4.76%) 32-entry, FA OPT IA 2.199 (2.01%) 2.625 (2.41%) 3.162 (2.54%) 4.011 (3.23%) 4.781 (4.24%) 5.517 (4.89%) 6.145 (4.56%) 7.689 (5.71%) 2.506 (2.23%) 3.067 (2.73%) 3.944 (3.63%) 4.293 (3.96%)

Absolute energy values are in millijoules. The numbers within the parentheses under the OPT and IA columns show their energy and cycles as a percentage of the base case.

the base case) for a VI-PT cache, since all the iTLB lookups are done in parallel with the iL1 cache. The overhead of the extra instructions for the BOUNDARY cases is very low. The schemes allow a translation to be already available in many situations even after one misses the VI-VT iL1. In such cases, we do not incur the extra latency for a iTLB lookup before we need to go to L2 which is physically addressed (both index and tag) in our evaluations. As can be observed from Figure 7, IA provides between 2–5% savings in execution cycles, with a saving of 3.55% on the average. These savings are a direct correspondence to how accurately IA is able to predict whether a iTLB lookup is really needed. Even though these applications are the ones with relatively high iL1 miss rates of the Spec2000 suite, it has been reported that commercial workloads (such as databases), have much higher iL1 miss rates [Ailamaki et al. 1999]. In such situations, our approach can provide substantial cycle savings as well, in addition to energy savings for VI-VT caches. This is explored further in Section 4.8 of this article. In the interest of clarity, in the remainder part of this section, we specifically focus on IA, which provides the best energy and cycles overall, and perform a more detailed sensitivity analysis with different hardware configurations while comparing it with OPT. 4.3 Sensitivity to iTLB Configuration 4.3.1 Monolithic (Single-Level) iTLB Configurations. Tables VIII and IX give the energy consumption and execution cycles for the base case as well as the OPT and IA executions with a VI-PT iL1 for four different iTLB configurations ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

TLB Energy Using Software an Hardware Techniques



247

Table IX. Execution Cycles (in Millions) with Different iTLB Configurations for IA (VI-PT) Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

1-entry 437.6 650.7 748.8 897.4 426.2 717.0

8-entry, FA 244.5 372.8 185.5 331.6 181.9 372.5

16-entry, 2-way 198.0 333.9 178.9 310.5 172.4 345.8

32-entry, FA 188.1 331.7 169.3 263.1 161.3 293.9

(1 entry, 8 entry FA, 16 entry 2-way, 32 entry FA). Note that the iTLB in our default configuration was 32 entry FA (its results are reproduced here for ease of comparison). While 8 through 32 may appear as reasonable sizes for an iTLB, the choice of also using a 1 entry iTLB was made to see if the instruction locality was good enough to itself provide good performance at a much smaller power consumption. Tables X and XI give the same information for a VI-VT iL1 lookup. Incidentally, the results for multilevel iTLB structures are given in Section 4.3.2. The iTLB energy for a given execution is given by na ∗ Ea + nm ∗ Em , where na and nm are the number of iTLB accesses and iTLB misses, respectively; and Ea and Em are the energy cost per access and per miss, respectively. For any particular scheme (whether it is IA, OPT or the base case), na remains the same when we change the iTLB configuration. While nm for a given scheme typically increases when we go for a smaller (or less associative) iTLB, the change is the same for all schemes. Hence, when the number of iTLB misses decreases, the importance of IA (or OPT) is felt even more (reflecting on the smaller percentage of energy consumption given in brackets in Tables VIII and X for better iTLBs). We find good benefits in terms of energy for all the configurations considered with IA for VI-PT and VI-VT (though the absolute energy for the latter itself is much lower than the former in the base case). While larger (and high-associative) iTLBs are good for reducing misses and providing good performance, their drawback is the high power consumption. The results presented above show that we can use a scheme such as IA in conjunction with a larger iTLB, to get its performance benefits, while consuming as low power as a much smaller iTLB which does not employ any power optimizations. More importantly, if we look at the absolute energy consumption values for VI-PT with IA, we can observe that they are in most cases comparable (and even smaller sometimes) to the absolute energy consumption of the base VIVT. For example, with a 16 entry 2-way iTLB used in conjunction with VI-PT iL1 and IA has an energy consumption of 6.623 mJ for 255.vortex while the same iTLB for a base line VI-VT turns out to consume 9.047 mJ. These results show that the choice of the cache indexing does not need to be governed by the iTLB power consumption with our mechanism (the StrongARM has possibly chosen a VI-VT L1 addressing scheming for TLB power optimization). We achieve this result without compromising on the performance benefits of VI-PT (recall that VI-VT incurs an extra latency on an iL1 miss)—compare the cycles for the same two iTLB configurations in Tables IX and XI where our approach with ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

248

I. Kadayif et al.



Table X. Energy Consumptions with Different iTLB Configurations (VI-VT) Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

Base 0.241 0.574 0.216 0.532 0.148 0.468

Benchmark Base 177.mesa 4.596 186.crafty 11.273 191.fma3d 4.204 252.eon 7.389 254.gap 2.818 255.vortex 9.047

1-entry OPT IA 0.043 (17.99%) 0.046 (19.23%) 0.089 (15.50%) 0.096 (16.89%) 0.040 (18.91%) 0.046 (21.37%) 0.082 (15.54%) 0.088 (16.55%) 0.030 (20.32%) 0.033 (22.39%) 0.112 (23.91%) 0.117 (25.15%) 16-entry, 2-way OPT IA 0.811 (17.66%) 0.878 (19.11%) 1.097 (9.74%) 1.217 (10.79%) 0.624 (14.85%) 0.699 (16.63%) 1.292 (17.48%) 1.374 (18.59%) 0.448 (15.89%) 0.488 (17.31%) 1.938 (21.43%) 2.044 (22.59%)

Base 3.440 7.976 3.316 5.187 1.978 6.280 Base 3.345 8.385 3.040 5.221 2.005 6.345

8-entry, FA OPT IA 0.472 (13.73%) 0.504 (14.65%) 1.091 (13.67%) 1.172 (14.69%) 0.469 (14.16%) 0.520 (15.68%) 1.024 (19.75%) 1.088 (20.98%) 0.350 (17.72%) 0.384 (19.43%) 1.463 (23.30%) 1.532 (24.39%) 32-entry, FA OPT IA 0.370 (11.06%) 0.401 (11.99%) 0.795 (9.48%) 0.884 (10.55%) 0.380 (12.52%) 0.440 (14.48%) 0.742 (14.21%) 0.808 (15.48%) 0.261 (13.06%) 0.291 (14.52%) 1.151 (18.14%) 1.217 (19.18%)

Absolute energy values are in millijoules. The numbers within the parentheses under the OPT and IA columns show their energy and cycles as a percentage of the base case.

Table XI. Execution Cycles (in Millions) with Different iTLB Configurations. (VI-VT) Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

Base 284.5 510.7 252.1 436.5 224.9 499.7

Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

Base 237.3 353.1 188.5 308.2 175.4 356.7

1-entry OPT IA 230.7 (81.10%) 232.8 (81.82%) 415.4 (81.33%) 421.2 (82.47%) 207.2 (82.17%) 211.2 (83.76%) 340.6 (78.02%) 344.0 (78.80%) 188.9 (83.95%) 191.8 (85.22%) 386.2 (77.27%) 389.8 (78.01%) 16-entry, 2-way OPT IA 216.8 (91.39%) 218.8 (92.20%) 336.1 (95.18%) 336.3 (95.24%) 180.6 (95.83%) 180.8 (95.93%) 289.8 (94.03%) 290.1 (94.15%) 169.2 (96.49%) 169.4 (96.57%) 332.2 (93.12%) 333.7 (93.54%)

Base 250.5 399.5 252.1 331.7 183.7 378.9 Base 196.1 350.5 176.6 274.7 165.6 310.5

8-entry, FA OPT IA 206.8 (82.53%) 206.9 (82.56%) 378.5 (94.73%) 378.6 (94.78%) 187.4 (74.33%) 187.8 (74.51%) 309.0 (93.16%) 309.8 (93.38%) 174.4 (94.95%) 175.5 (95.56%) 352.4 (92.99%) 353.1 (93.20%) 32-entry, FA OPT IA 189.4 (96.60%) 189.5 (96.64%) 333.5 (95.16%) 333.8 (95.23%) 170.2 (96.39%) 170.4 (96.48%) 264.7 (96.36%) 264.9 (96.43%) 161.8 (97.73%) 161.9 (97.77%) 298.7 (96.20%) 299.0 (96.28%)

The numbers within the parentheses under the OPT and IA columns show the corresponding absolute value as a percentage of the base case.

VI-PT does around 3% better in terms of cycles compared to the base VI-VT execution. On the average, for the VI-VT cache, the average savings due to IA in execution time amount to 18.1%, 11.0%, 5.4%, 3.55% respectively for 1-entry, 8-entry, 16-entry, and 32-entry iTLBs. Sometimes, even when we miss in iL1, we may be able to find the translation in the CFR with IA, avoiding an iTLB lookup (incurring a performance and energy penalty) before going to L2. This is particularly because of the larger spatial locality coverage provided by the CFR (which works at a page granularity) compared to the cache block granularity of iL1. For instance, a reference to a block within a page that is missing in both iL1 and the CFR will cause a miss in IA with VI-VT as well. However, an immediate cache miss for another block within the same page will hit in the CFR, thus avoiding a iTLB lookup for IA. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

TLB Energy Using Software an Hardware Techniques



249

Fig. 8. iTLB energy consumptions of two-level iTLB configurations (Left) and their execution cycles (Right). The energy consumptions and execution cycles of 1-entry level-1 and 32-entry, FA level-2 configuration and 32-entry, FA level-1 and 96-entry, FA level-2 configuration are normalized, respectively, with respect to the corresponding values of a 32-entry monolithic iTLB with IA and of a 128-entry monolithic iTLB with IA.

4.3.2 Multilevel iTLB Configurations. A multi-level TLB is not only a way of optimizing TLB performance, but can also be an effective way of reducing power consumption. By satisfying many lookups in a much smaller first-level TLB, we can reduce the dynamic power consumption of the larger second-level TLB (assuming they are looked up sequentially). However, this can typically increase the complexity of implementation (and area), and push latencies higher. In fact, on the Itanium, the first-level TLB can be looked up in one cycle, but the larger second-level TLB lookup takes as long as 10 cycles [IM 1999]. To compare how effective our scheme can be in relation to a multilevel iTLB structure, we have conducted numerous experiments with different configurations—(i) 1-entry level-1 and 32-entry, FA level-2, (ii) 32-entry, FA level-1 and 96-entry, FA level-2 (as in the dTLB of IA-64)—both of which have been evaluated with serial (i.e., the second level is looked up only on a firstlevel miss) and parallel (i.e., both levels are looked up in parallel—this may have some performance benefits in terms of overlapping the lookup latency for the second level). We are not presenting the results for the parallel lookup here, because their energy consumption values are much worse. Here, we compare a monolithic 32-entry, FA iTLB using IA with configuration (i), and a monolithic 128-entry, FA iTLB using IA with configuration (ii). The normalized dynamic energy consumption and performance cycles are given in Figure 8. To give the multi-level iTLB structure the benefit of doubt, we have optimistically assumed a single (extra) cycle lookup for the second level when the first level misses. When we look at the 32-entry experiments, the base execution with a twolevel structure consumes 55.3% more energy than a monolithic 32-entry iTLB using IA. This is because in the IA scheme, the energy consumption in the common case (i.e., when the address is present in CFR) is just the register access/lookup. On the other hand, even with a 1-entry level-1 iTLB, there needs to be a comparison to check whether the translation exists. As a result, the energy differences between these two executions are a consequence of the extra comparison that is involved with a 1-entry, level-1 (it is to be noted that whenever we miss in the level-1 base case, we are also not going to be finding the ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

250



I. Kadayif et al.

Fig. 9. Left: Normalized iTLB energies with different iL1 configurations (x-axis) using VI-VT. Each bar is normalized with respect to the base case using the same iL1 configuration. The results for the VI-PT case are not given as they do not change much when iL1 configuration is modified. Right: Normalized iTLB energies with different page sizes (x-axis). Each bar is normalized with respect to the base configuration with the corresponding page size. For each page size, the first (OPT, IA) pair corresponds to the VI-PT case whereas the second (OPT, IA) pair corresponds to the VI-VT case.

translation in the CFR for the IA in the monolithic configuration). On the other hand, the performance of the monolithic iTLB with IA does turn out to be a better alternative (between 2–10%). This is because we do not incur any extra latencies looking up the second level if the first level (CFR) misses. When we have a 1-entry, level-1 iTLB, the performance penalties can become a concern and the additional second level lookup latency may be incurred often. To offset this, one could increase the number of entries in the first level as in configuration (ii) to ensure the working set is captured by the first level. However, the results presented in Figure 8 show that while performance is optimized, the energy consumption deteriorates significantly. In summary, our IA approach can provide more energy savings than a multilevel iTLB which uses a 1-entry first level, while not suffering from any performance deficiencies (which a multilevel structure can). Its benefits are more significant when the entries in the first-level iTLB become larger.

4.4 Sensitivity to iL1 Configuration Changing the iL1 configuration can have an effect on the iTLB performance for a VI-VT implementation, while there is not really an influence for a VIPT implementation. Consequently, we have experimented with different iL1 parameters (8KB, 1-way; 8KB 4-way; 32KB, 1-way) for a VI-VT scenario. All these caches have a block size of 32 bytes. We give the iTLB energy consumption values with these iL1 configurations using OPT and IA in Figure 9, normalized with respect to their corresponding base cases. We observe from this figure that the benefits of IA are more significant at smaller or less associative iL1 configurations, since these incur more misses (and the iTLB can get in the critical path). As was explained in Section 4.3.1, the CFR may be able to satisfy some of these requests even after an iL1 miss because of its page level coverage. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

TLB Energy Using Software an Hardware Techniques



251

Table XII. iTLB Energy Comparison Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

PI-PT (Base) 104.01 115.24 104.47 115.03 104.11 106.00

PI-PT (IA) 2.48 3.70 5.23 6.77 2.83 4.24

VI-PT (Base) 109.07 124.11 112.68 134.54 112.20 108.42

VI-VT (IA) 3.34 8.38 3.04 5.22 2.00 6.34

All values are in millijoules.

4.5 Impact of Page Size Page size also has a direct bearing on iTLB coverage. We have run experiments with 4KB- 8KB- and 16KB-sized pages, and the energy results for OPT and IA normalized with respect to their base executions are given in Figure 9. We observe that the energy reductions get better with larger page sizes. A larger page size not only improves iTLB coverage, but also improves the coverage of the CFR, thus reducing the number of times we need to go to the iTLB. We also see that IA comes close to OPT across the spectrum of page sizes considered. 4.6 Comparison with Low-Power TLBs There have been a few efforts on designing low-power TLB structures. Some of these efforts focus on circuit-level techniques while the others consider architectural modifications to TLB. We compare our approach with two techniques. The first approach is circuit based and reduces the original power consumption of a fully associative TLB by 15% [Juan et al. 1997]. The second technique by Choi et al. [2002] proposes a multi-banked TLB organization where each bank is half the size of the original TLB. Even if we optimistically assume that there is no performance impact by this reorganization and that the energy consumption per access gets cut by half, this strategy can at best lower iTLB energy by 50%. On the other hand, the results for IA in Figure 6 show that we can get over 85% energy savings with our approach, making it a better optimization alternative. We are able to achieve this without any performance degradation. Further, as was mentioned earlier, we can employ our technique in conjunction with these low power structures to further their energy savings. It can also be used in conjunction with dynamic TLB reorganization techniques that are presented in Balasubramonian et al. [2000]. 4.7 PI-PT iL1 Lookup This form of iL1 addressing is not really in fashion because of the additional latency in the critical path of iTLB lookup before iL1 is accessed (as mentioned earlier there are some ways of getting around this if the iL1 indexing can be done with just the page offset bits, in which case it is no different from VI-PT and it restricts iL1 configurations). However, our approach can also be used in conjunction with a PI-PT iL1, as long as we can provide translations most of the time. To investigate this issue, we have conducted experiments with a PI-PT iL1 cache, and the energy and performance results are presented in Tables XII ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

252



I. Kadayif et al. Table XIII. Execution Cycle Comparison Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

PI-PT (Base) 250.6 410.4 241.6 330.4 214.7 360.9

PI-PT (IA) 195.5 343.7 189.8 282.9 167.6 308.6

VI-PT (Base) 188.1 331.7 169.3 263.1 161.3 293.9

VI-VT (IA) 196.1 350.5 176.6 274.7 165.6 310.5

All values are in millions.

and XIII, respectively. These tables compare (i) base PI-PT, (ii) PI-PT with IA, (iii) base VI-PT, and (iv) base VI-VT. All experiments have been performed with our default configuration parameters. As can be expected, the base PI-PT does much worse than (iii) or (iv) in terms of execution cycles, while consuming as much energy as (iii). However, we find that incorporating IA into PI-PT substantially lowers the execution cycles, bringing its performance within 5.7% of the base VI-PT on the average, while doing significantly better than it in terms of energy. IA with PI-PT comes even more closer to the base VI-VT in terms of cycles (even beating it in three of our six applications). At the same time, it expends less energy than the base VI-VT in three applications. These results suggest that PI-PT (which is largely ignored today) may not be a bad idea at all for iL1 addressing when used in conjunction with our optimizations. Even though we model the iTLB lookup to take a cycle in a system like PI-PT (i.e., this cycle is incurred before a iL1 lookup), note that this is quite pessimistic. Typically, this can be done much faster. In a VI-PT scheme that hits in the iTLB, we assume the virtual address is available from the iTLB by the time the blocks of a set are selected in the iL1. The subsequent tag comparison for the ways in this set can thus be performed using physical addresses, and consequently, the cache hit can be completed in 1 cycle. Note that we are modeling it this way even in the default VI-PT (i.e., even without our enhancements). 4.8 Impact of High iL1 Misses A VI-VT cache automatically lowers the iTLB energy considerably, requiring a lookup only on a cache miss. However, the iTLB on a VI-VT iL1 cache is highly sensitive to iL1 misses. The higher the iL1 misses, the worse its performance and energy costs. It has been reported [Ailamaki et al. 1999] that iL1 misses can turn out to be quite substantial, accounting for up to 40% of memory stall latencies, in some memory-resident database applications. In such situations, our approach can turn out to be a better energy saving solution (and maybe even a performance optimization) when used in conjunction with a VI-PT cache. Of course, it can also be used with the VI-VT cache itself for performance and energy improvements as was pointed out in earlier results. To investigate this issue, we have conducted experiments with small iL1 configurations which accentuate the misses in these applications (though we would have liked to run this study with real database workloads, we have not yet been able to get them working with SimpleScalar for a homogeneous ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

TLB Energy Using Software an Hardware Techniques



253

Table XIV. iTLB Energy Comparison with 512 Byte iL1 Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

VI-PT (Base) 115.10 121.38 112.54 120.11 114.46 117.27

VI-PT (IA) 2.49 3.58 5.33 6.82 2.78 4.21

VI-VT (Base) 14.53 17.18 11.39 15.48 12.69 16.61

VI-VT (IA) 1.60 2.34 3.22 3.79 1.94 3.52

All values are in millijoules.

Table XV. Execution Cycle Comparison with 512 Byte iL1 Benchmark 177.mesa 186.crafty 191.fma3d 252.eon 254.gap 255.vortex

VI-PT (Base) 452.4 526.7 362.9 472.7 401.3 532.3

VI-PT (IA) 452.4 526.7 362.9 472.7 401.3 532.3

VI-VT (Base) 488.8 566.5 390.2 507.1 431.6 576.7

VI-VT (IA) 456.8 532.1 370.7 480.9 405.7 544.6

All values are in millions.

comparison). With a 512 byte 2-way iL1, the miss rates for mesa, crafty, fma3d, eon, gap, and vortex become 12.60%, 14.17%, 10.12%, 12.89%, 11.09% and 14.17%, respectively. The iTLB energy/performance results with this small iL1 for (i) Base VI-PT, (ii) VI-PT with IA, (iii) Base VI-VT, and (iv) VI-VT with IA are given in Tables XIV and XV. We can see that the performance penalty of VI-VT starts hurting with such a small iL1, with execution time being extended by 7.73% on the average over VI-PT. However, we find that IA with VI-PT can provide even lower energy consumption than VI-VT, while not suffering from this performance deficiency. Augmenting VI-VT with IA, is another way of reducing this overhead as can be seen in the results. 4.9 Discussion Since we add a new component (CFR) to the architecture, one may wonder what would be its impact on leakage energy consumption, which is a growing concern with scaling technology. While this component certainly increases the leakage consumption, we do not expect this to be significant in practice since this structure is very small (just 1 register). In addition, as has been discussed in the article, in some cases our schemes actually bring performance benefits, which will help reduce overall leakage consumption. In fact, one can expect this benefit to be more pronounced than the small leakage increase due to the CFR. Moreover, our approach can help further reduce leakage in the following way. Since we are increasing the iTLB inter-access times, once can employ energysaving techniques for reducing iTLB leakage. For example, one can shut down iTLB entries [Delaluz et al. 2003], relaying on the fact that most of address translation requests are satisfied by the CFR, and an iTLB access is not needed. To sum up, considering these factors, we believe that the proposed schemes do not worsen the leakage behavior of the system. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

254



I. Kadayif et al.

Another important issue is the choice of the branch prediction logic employed in this study. Recall that the predictor we use in this study is a modest one based on a 2-bit up-down counter. A more sophisticated branch predictor such as hybrid predictors that are in use in commercial architectures can further improve the energy savings provided by our IA scheme. This is because a better predictor would provide more accurate information, allowing IA to be more precise. However, a detailed study of this issue is beyond the scope of this article. Yet another issue is the impact of borrowing an extra bit from the address to indicate whether the CFR or the iTLB needs to be used for a particular address translation. While this can potentially prevent implementing some long jumps, in our experiments we never experienced this problem. Since the address space provided by 64 bits is too large to use completely, we do not expect this to be an issue in the near future. The next issue that we want to discuss in this section is the impact of extra instructions inserted in the code to update CFR contents. It should be mentioned that this number is very small in general, and its impact on overall code size was less than 2% for IA across all applications we experimented with. Because of the same reason, the impact of these instructions on the instruction cache energy consumption was found to be very small (less than 1% across all our applications in our experimental suite). Finally, one might wonder the potential implications when DLL (dynamic linking) is used. It should be noticed that, in this case, DLLs will be on a separate page (without any application code instructions on the same page). A call (jump) to this code from the application would be treated the same way as a branch (explained in the paper). Once we are in the DLL code, we can treat the DLL itself as application code (that has already been compiled and loaded), that is, when compiling the sources for this library, the compiler detects page boundaries to put markers, and the branches work as explained earlier in the article. 5. SUMMARY OF RESULTS AND CONCLUDING REMARKS One cannot afford to ignore any component when optimizing power consumption within a chip. We need to investigate optimizations for each component, regardless of what its contribution to the overall power consumption may be. Power reductions in one component may shift the bottleneck to another. At the same time, power density (dissipation per unit area) can become an even more important consideration for cooling, packaging, and reliability. This is particularly important for small structures such as the iTLB which is very frequently accessed. Consequently, this paper has proposed hardware and software mechanisms for dynamic power optimizations within the iTLB. These mechanisms are intended to reduce the number of times that the iTLB is accessed, and can also work very well in conjunction with other circuit/architectural techniques for furthering the power savings. Of the different techniques that were proposed and evaluated, the IA strategy which uses compiler analysis to track page boundary crossings, and a simple piece of hardware in conjunction with a branch predictor for branches out of ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

TLB Energy Using Software an Hardware Techniques



255

a page, can effectively cut energy consumption by over 85%. It works well on both VI-PT and VI-VT iL1 caches. At the same time, these mechanisms are different from keeping a two-level iTLB (with the first level being 1 entry). In such a structure, a comparison is still needed to find out whether the translation exists, while three of our mechanisms (IA, SoCA, and SoLA) are already sure of this, leading to less energy consumption. Some of the detailed observations and contributions of this work are in the following: — Our optimization mechanisms achieve significant iTLB power savings without compromising on performance. Their importance grows with higher iL1 miss rates (as in database applications) and larger page sizes (which is a trend these days). They can work very well with large iTLB structures (that can possibly consume more power and take longer to lookup), without them getting into the common case. — These solutions are also very effective in removing the iTLB from the critical path of a VI-VT lookup mechanism, and can thus turn out to cut execution cycles as well in such cases. — While a VI-VT mechanism can automatically provide good iTLB power savings over VI-PT, their drawback is in possible performance degradation with higher iL1 miss rates. At the same time, there are some drawbacks with VIVT (even if we are to avoid cache aliasing with extra bits), since write-backs need to work with physical addresses—consequently, some VI-VT mechanisms keep both physical and virtual tags with each cache line to handle write-backs [SA; Jacob and Mudge 2001]. Our mechanisms, on the other hand, can take VI-PT and provide as good power savings as VI-VT (if not better) without incurring any performance degradation. Further, they can take VI-VT and improve its performance (depending on how VI-VT is implemented) to approach that of VI-PT while furthering its power savings. Our contributions thus make it possible to remove the iTLB power consumption from being an issue for iL1 design (indexing/lookup strategy). — We have even ventured further to examine the ramifications with a PI-PT iL1 which is largely ignored today (unless with very specific iL1 configurations). We have shown that our mechanisms can reduce the performance penalty with this kind of iL1 addressing considerably to make it competitive with a VI-PT iL1. Further, VI-PT and VI-VT iL1 caches require translations (storing physical addresses within each cache block) for write-backs increasing the iL1 complexity and power dissipation. On the other hand, PI-PT does not require this, and with our scheme we can provide the performance and power consumption of these fancier cache indexing mechanisms without the drawbacks. — This work can be viewed as taking another step in the direction of removing the TLB altogether that was investigated in Jacob and Mudge [2001]. We are now less dependent on the actual iTLB structure in terms of its lookup latency. From the hardware point of view, this strategy can save on-chip area, in addition to optimizing power consumption and power density. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

256



I. Kadayif et al.

Finally, it is to be emphasized that the dynamic energy savings with our mechanisms are more a consequence of the reduced number of iTLB accesses, and the percentage improvements are likely to hold with technology or circuit level improvements. Having identified the potential of this different philosophy in generating physical addresses for the instruction stream, we are currently examining similar approaches for data references. This is particularly important in certain multiported dTLBs, where memcopy operations require constant load and store lookups, leading to significant power consumption. We are also looking to perform code layout transformations, and data/code restructuring to benefit from the reuse of the translation within the CFR. Finally, we plan to see how converting branch instructions to predicated ones would impact the behavior of the proposed schemes. REFERENCES AILAMAKI, A., DEWITT, D. J., HILL, M. D., AND WOOD, D. A. 1999. DBMSs on a modern processor: Where does time go? In Proceedings of the 25th International Conference on Very Large Data Bases (Edinburgh, UK). BALASUBRAMONIAN, R., ALBONESI, D., BUYUKTOSUNOGLU, A., AND DWARKADAS, S. 2000. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In Proceedings of the 33rd International Symposium on Microarchitecture. 245–257. BROOKS, D. AND MARTONOSI, M. 2001. Dynamic thermal management for high-performance microprocessors. In Proceedings of the International Symposium on High-Performance Computer Architecture. BROOKS, D., TIWARI, V., AND MARTONOSI, M. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th International Symposium on Computer Architecture. (Vancouver, British Columbia, Canada). BURGER, D., AUSTIN, T., AND BENNETT, S. 1996. Evaluating future microprocessors: The SimpleScalar tool set. Tech. Rep. CS-TR-96-103, Computer Science Department, University of Wisconsin, Madison, W. Sc., July. CATTHOOR, F., WUYTACK, S., GREEF, E. D., BALASA, F., NACHTERGAELE, L., AND VANDECAPPELLE, A. 1998. Custom Memory Management Methodology—Exploration of Memory Organization for Embedded Multimedia System Design. Kluwer Academic Publishers. CEKLEOV, M. AND DUBOIS, M. 1997. Virtual-address caches. Part 1: Problems and solutions in uniprocessors. IEEE Micro 17, 5 (Sept.), 64–71. CHIUEH, T.-C. AND KATZ, R. H. 1992. Eliminating the address translation bottleneck for physical address cache. In ASPLOS. 137–148. CHOI, J.-H., LEE, J.-H., JEONG, S.-W., KIM, S.-D., AND WEEMS, C. 2002. A low-power TLB structure for embedded systems. IEEE Comput. Archit. Lett. 1. DELALUZ, V., KANDEMIR, M., SIVASUBRAMANIAM, A., IRWIN, M.J., AND VIJAYKRISHNAN, N. 2003. Reducing DTLB energy through dynamic resizing. In Proceedings of the International Conference on Computer Design (San Jose, Calif.). FOLEGNANI, D. AND GONZALEZ, A. 2001. Energy-effective issue logic. In Proceedings of the 28th International Symposium on Computer Architecture (Goteborg, Sweden). GHOSE, K. AND KAMBLE, M. B. 1998a. Energy efficient cache organizations for superscalar processors. In Proceedings of the Power Driven Microarchitecture Workshop of 25th International Symposium on Computer Architecture (Barcelona, Spain). GHOSE, K. AND KAMBLE, M. B. 1998b. Reducing power in superscalar processor caches using subbanking, multiple line buffers and bit-line segmentation. In Proceedings of the International Symposium on Low power Electronics and Design (San Diego, Calif.). GHOSE, K. AND KAMBLE, M. B. 1999. Reducing power in superscalar processor caches using subbanking, multiple line buffers, and bit-line segmentation. In Proceedings of the International Symposium Low Power Electronics and Design. 70–75. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

TLB Energy Using Software an Hardware Techniques



257

HENNING, J. L. 2000. SPEC2000: Measuring CPU performance in the new millenium. IEEE Comput. Mag. 28–35. IM. 1999. Itanium manual. http://developer.intel.com/design/itanium/manuals.htm. INOUE, K., ISHIHARA, T., AND MURAKAMI, K. 1999. Way-predicting set-associative cache for high performance and low energy consumption. In Proceedings of the International Symposium on Low Power Electronics and Design. 273–275. JACOB, B. AND MUDGE, T. 2001. Uniprocessor virtual memory without tlbs. IEEE Trans. Comput. 50, 5 (May), 482–49. JUAN, T., LANG, T., AND NAVARRO, J. J. 1997. Reducing TLB power requirements. In Proceedings of the International Symposium on Low Power Electronics and Design. KANDEMIR, M., RAMANUJAM, J., AND SEZER, U. 2001. Compiler support for block buffering. In Proceedings of the International Symposium on Low Power Electronics and Design. Huntington Beach, California. KIM, S. 2001. Low power MMU design for embedded processors. http://supercom.yonsei.ac.kr/ temp/sam.ppt. KNIGHT, J. AND ROSENFELD, P. 1984. Segmented virtual to real translation assist. IBM Tech. Disc. Bull. 27, 2 (July), 1077–1078. LEE, H.-H. S. AND BALLAPURAM, C. S. 2003. Energy efficient D-TLB and data cache using semanticaware multilateral partitioning. In Proceedings of the International Symposium on Low Power Electronics and Design (Seoul, Korea). LEE, J.-H., PARK, G.-H., PARK, S.-B., AND KIM, S.-D. 2003. A selective filter-bank TLB system. In Proceedings of the International Symposium on Low Power Electronics and Design (Seoul, Korea). MADDOCK, R., MARKS, B., MINSHULL, J., AND PINNELL, M. 1981. Hardware address relocation for variable length segments. IBM Tech. Disc. Bull. 23, 11 (Apr.), 5186–5187. MANNE, S., KLAUSER, A., GRUNWALD, D., AND SOMENZI, F. 1997. Low-power tlb design for high performance microprocessors. In Tech. Rep. Boulder, CO. PANWAR, R. AND RENNELS, D. 1995. Reducing the frequency of tag compares for low-power i-cache design. In Proceedings of the International Symposium on Low power Electronics and Design (Dana Point, Calif.). PARIKH, D., SKADRON, K., ZHANG, Y., BARCELLA, M., AND STAN, M. 2002. Power issues related to branch prediction. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture. 233–244. REINMAN, G. AND JOUPPI, N. P. 2000. CACTI 2.0: an integrated cache timing and power model. Tech. Rep. 2000/7, Compaq. February. SA. 2002 Intel StrongARM processor. http://www.intel.com/design/pca/applicationsprocessors/ 1110 brf.htm. SIMA, D., FOUNTAIN, T., AND KACSUK, P. 1997. Advanced Computer Architecture: A Design Space Approach. Addison-Wesley, Reading, Mass. STRECKER, W. D. 1978. VAX-11/780: A virtual address extension to the DEC PDP-11 family. In AFIPS NCC. Vol. 47. VIJAYKRISHNAN, N., KANDEMIR, M., KIM, H. Y., IRWIN, M. J., AND YE, W. 2000. Energy-driven integrated hardware-software optimizations using simplepower. In Proceedings of the International Symposium on Computer Architecture. YANG, J., ZHANG, Y., AND GUPTA, R. 2000. Frequent value compression in data caches. In Proceedings of the 33rd International Symposium on Microarchitecture. Monterey, CA, 258–265. ZYUBAN, V. AND KOGGE, P. 1998. Split register file architectures for inherently lower power microprocessors. In Proceedings of the Power-Driven Microarchitecture Workshop (in conjunction with ISCA’98). 32–37. Received October 2003; revised February 2004; accepted March 2004

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Efficient Techniques for Transition Testing XIAO LIU and MICHAEL S. HSIAO Virginia Tech and SREEJIT CHAKRAVARTY and PAUL J. THADIKARAN Intel Corporation

Scan-based transition tests are added to improve the detection of speed failures in sequential circuits. Empirical data suggests that both data volume and application time will increase dramatically for such transition testing. Techniques to address the above problem for a class of transition tests, called enhanced transition tests, are proposed in this article. The first technique, which combines the proposed transition test chains with the ATE repeat capability, reduces test data volume by 46.5% when compared with transition tests computed by a commercial transition test ATPG tool. However, the test application time may sometimes increase. To address the test time issue, a new DFT technique, Exchange Scan, is proposed. Exchange scan reduces both data volume and application time by 46.5%. These techniques rely on the use of hold-scan cells and highlight the effectiveness of hold-scan design to address test time and test data volume issues. In addition, we address the problem of yield loss due to incidental overtesting of functionally-untestable transition faults, and we formulate an efficient adjustment to the algorithm to keep the overtest ratio low. Our experimental results show that up to 14.5% reduction in overtest ratio can be achieved, with an average overtest reduction of 4.68%. Categories and Subject Descriptors: B.8.1 [Performance and Reliability]: Reliability, Testing, and Fault-Tolerance; I.1.2 [Symbolic and Algebraic Manipulation]: Algorithms; J.6 [Computer-Aided Engineering] General Terms: Algorithms, Reliability Additional Key Words and Phrases: Test application time reduction, test chain, test data volume reduction, transition faults, yield loss

1. INTRODUCTION Higher clock rate, shrinking geometries, increasing metal density, and so forth are resulting in delay defects which cause speed failures. Delay defects in a circuit do not change the logic behavior of the circuit, but affect the speed at This research was supported in part by a grant from Intel Corp., and in part by NSF under contract CCR-0196470. A preliminary version of this work appeared in ITC 2002. Authors’ addresses: X. Liu, 8215 Southwestern Blvd., #1015, Dallas, TX 75206; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected].  C 2005 ACM 1084-4309/05/0400-0258 $5.00 ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005, Pages 258–278.

Efficient Techniques for Transition Testing



259

which the circuit can operate. Although the conventional stuck-at (s@) testing can uncover some delay defects, the stuck-at fault model [Eldred 1959] does not model speed related failures very well. This has prompted researchers to propose a variety of fault models for speed failures, that is, transition fault [Waicukauski et al. 1987], path delay fault [Smith 1985], and segment delay fault [Heragu et al. 1996]. A transition fault at node X assumes a large delay at X such that the transition at X will not reach the latch or primary output within the clock period. The path delay fault model considers the cumulative effect of gate delays along a specific path, from a primary input to a primary output. If the cumulative delay exceeds the slack for the path, then the chip fails. Segment-delay fault targets path segments instead of complete paths. Among these models, the transition fault model is the most practical, and commercial tools are available for computing such tests. The number of transition faults is linear to the number of circuit lines, while the number of path delay faults are exponential to the number of circuit lines, which makes the need for critical path analysis and identification procedure a necessity. Transition tests have been generated to improve the detection of speed failures in microprocessors [Tendulkar et al. 2002] as well as ASICs [Hsu et al. 2001]. At each line in the circuit, two transition faults are possible: slow-to-rise and slow-to-fall. Test pattern for a transition fault consists of a pair of vectors {V1,V2}, where V1 (initial vector) is required to set the target node to an initial value, and V2 (test vector) is required to launch the appropriate transition at the target node and also propagate the fault effect to a primary output [Waicukauski et al. 1987; Savir and Patil 1993]. Transition tests can be applied in three different ways: Broadside [Savir and Patil 1994], Skewed-Load [Savir and Patil 1994], and Enhanced-Scan [Dervisoglu and Stong 1991]. For broadside testing (also called functional justification), a vector is scanned in and the functional clock is pulsed to create the transition and subsequently capture the response. For each pattern in broadside testing, only one vector is stored in tester scan memory. The second vector is derived from the first by pulsing the functional clock. For skewed-load transition testing, an N-bit vector is loaded by shifting in the first N-1 bits, where N is the scan chain length. The last shift clock is used to launch the transition. This is followed by a quick capture. For skewed-load testing also, only one vector is stored for each transition pattern in tester scan memory; the first vector is a shifted version of the stored vector. Finally, for enhanced scan transition testing, two vectors (V1, V2) are stored in the tester scan memory. The first scan shift loads V1 into the scan chain. It is then applied to the circuit under test to initialize it. Next, V2 is scanned in, followed by an apply, and subsequently a capture of the response. During the shifting in of V2, it is assumed that the initialization of V1 is not destroyed. Therefore, enhanced scan transition testing assumes a hold-scan design [Dervisoglu and Stong 1991]. Among the three kinds of transition tests, broadside suffers from poor fault coverage [Savir and Patil 1994]. Since there is no dependency between the two vectors in enhanced scan, it can give better coverage than the skewed-load transition test. Skewed-load transition tests also lead to larger test data volume. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

260



X. Liu et al. Table I. Test Data Volume Comparison Name c3540 c5315 c6288 c7552 s13207 s15850 s35932 s38417 s38584

S@ Test 179 124 36 237 485 464 75 1017 727

Trans Test 366 248 80 378 1002 924 137 1927 1361

Expansion 4.09 4 5 3.19 4.13 3.98 3.65 3.79 3.74

Compared to stuck-at tests, the increase in the number of vectors required for enhanced scan to get complete coverage is about 4X (Table I). This data, collected using a state-of-the-art commercial ATPG tool, shows the number of stuck-at vectors and the number of enhanced scan transition patterns for each circuit. Note that each transition pattern consists of two vectors. For the skewed-load transition test, it has been observed that the data volume for an ASIC has an increase of 5.9X [Hsu et al. 2001]. A drawback of enhanced scan testing is that it requires hold-scan scan cells. However, in microprocessors and other high performance circuits that require custom design such cells are used for other reasons. In custom designs, the circuit often is not fully decoded, and hold-scan cells are used to prevent contention in the data being shifted, as well as preventing excessive power dissipation in the circuit during the scan shift phase. Furthermore, if hold-scan cells are used, the failing parts in which only the scan logic failed can often be retrieved, thus enhancing, to some extent, the diagnostic capability associated with scan DFT. In this article, we will consider only enhanced scan transition tests. A number of techniques on reducing test data volume and test application time for single cycle scan tests have been presented in the literature [Keller et al. 2001; Hamzaoglu and Patel 1999; Koeneman 1991; Chandra and Chakrabarty 2001; Lee et al. 1998; Das and Touba 2000]. These methods assume that only 5–10% of the bits are fully specified. Unspecified bits are filled to identify the easy-to-detect faults. Different codes to compress the information in the tester and decompressing them on chips [Keller et al. 2001; Chandra and Chakrabarty 2001; Das and Touba 2000], or using partitioning and applying similar patterns to the different partitions [Hamzaoglu and Patel 1999; Lee et al. 1998] have been proposed. The techniques proposed here compliment the work on compressing individual vectors. In our previous work [Liu et al. 2003], we presented two algorithms for computing transition test patterns from generated s@ test vectors which can reduce the test set size by 20%. Nevertheless, there has not been much work on addressing the explosion in data and application time for transition tests. To tackle this problem, we first propose novel techniques to generate effective transition test chains based on stuck-at vectors, thus the need for a separate transition ATPG is eliminated. Next, we propose methods to reduce the transition test ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Efficient Techniques for Transition Testing



261

data volume and test application time. Our optimization techniques consider factors across test patterns for transition tests. The first technique uses the ATE repeat option to reduce the test data volume and requires the use of transition test chains, rather than the random application of the individual transition test patterns. We describe the transition test chains and present a novel algorithm for computing such chains. Experimental results show an average data volume reduction of 46.5%, compared to the conventional enhanced scan transition test data volume computed by COM, a commercial ATPG tool. This technique does not necessarily decrease test application time. To reduce test application time, a new DFT technique called exchange scan is proposed . Combining exchange scan with transition test chains reduces both the test application time and test data volume by 46.5%, compared to a conventional transition test set computed by COM. Nevertheless, one of the drawbacks of scan-based delay tests is possible yield loss due to overtesting. In Rearick [2001], the author showed that scan-based testing may fail a chip due to the delay faults that do not affect the normal operation, and thus it is unnecessary to target those functionally unsensitizable faults. In Lai et al. [2000a, 2000b], the authors addressed the impact of delay defects on the functionally untestable paths on the circuit performance. Moreover, a scan test pattern, though derived from targeting functionally testable transition faults, can incidentally detect functionally untestable faults if the starting state is an arbitrary state (could be an unreachable state). In this article, we use a low-cost implication engine to compute the functionally untestable faults and address the problem of overtesting in our graph-formulated transition ATPG algorithm. The rest of the article is organized as follows. In Section 2, the ATE model we use is described. Section 3 proposes a novel transition test chain formulation which is mapped into a weighted transition-pattern-graph traversal problem. A new DFT technique to reduce test application time by reducing the number of scan loads is presented in Section 4. Section 5 describes a novel method to address the problem of incidentally overtesting of functionally untestable faults in our ATPG algorithm. Section 6 presents a summary of all the experimental results. Finally, Section 7 concludes the article. 2. ATE MODEL Figure 1 is an abstraction of the tester model we use. ATE storage consists of two parts: scan and control memory. Scan memory is divided into several channels. Each channel consists of three bits. For each clock cycle of the scan shift operation, the scan memory contains the bit to be scanned in, the expected response bit from the circuit under test (CUT), and an indication as to whether or not this bit of the response is to be masked, indicated by M or U in Figure 2. Figure 2(a) shows the data stored for a single scan channel for the test set {V1, V2, V3, V4, V5, V6, V7, V8, V9, V10}, with Rj the expected response for Vj. The control memory controls the shift and the comparison operation. The scan memory depth required is (N+1)*S bits for a test set of size N and scan length S. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

262



X. Liu et al.

Fig. 1. Tester memory model.

Fig. 2. ATE storage model.

The enhanced scan transition test set in Table II consists of 6 transition test patterns. Each pattern consists of a pair of vectors and the expected response to the test vector. The first pattern in our example consists of the pair (V1, V2) and the response R2 to V2. Storage of the test data in the scan memory is shown in Figure 2(b). The storage depth required is N*2*S+N*S bits. The control sequence for this test set, shown in Table III, is very repetitive and stored in the tester scan memory. Row 1 of the Table states that V1 is scanned in and applied to the CUT. Row 2 states that V2 is scanned in, applied to the CUT, and the response R2 captured. Row 3 states that the first vector of the next vector pair, that is V2, is scanned in, while the response R2 of the previous test pattern is scanned out and compared against the expected response. Once the scan operation is complete, the new vector is applied to the CUT. The rest of the entries can be similarly interpreted. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Efficient Techniques for Transition Testing



263

Table II. Enhanced Scan Transition Test Set Example Vector1 V1 V2 V3 V3 V1 V4

Vector2 V2 V3 V4 V5 V3 V3

Response R2 R3 R4 R5 R3∗ R3∗∗

Table III. Control Sequence of Transition Test (i) Shift in V1; (ii) Apply; (i) Shift in V2; (ii) Apply; (iii) Capture; (i) Shift in V2; Shift Out and Compare R2; (ii) Apply; (i) Shift in V3; (ii) Apply; (iii) Capture; (i) Shift in V3; Shift Out and Compare R3; (ii) Apply; (i) Shift in V4; (ii) Apply; (iii) Capture; (i) Shift in V3; Shift Out and Compare R4; (ii) Apply; (i) Shift in V5; (ii) Apply; (iii) Capture; (i) Shift in V1; Shift Out and Compare R5; (ii) Apply; (i) Shift in V3; (ii) Apply; (iii) Capture; (i) Shift in V4; Shift Out and Compare R3*; (ii)Apply; (i) Shift in V3; (ii)Apply; (iii) Capture; (i) Shift out and compare R3**;

3. ATE REPEAT AND TRANSITION TEST CHAINS 3.1 ATE Repeat There is considerable redundancy in the information stored in the tester. In Figure 2(b), V2, V3 are used several times in the test sequence. Ideally, one copy of the information should suffice. However, storing one copy of a vector and reusing it in any random order requires the ATE to index into random locations in the scan memory, an operation which is currently not available. Limited reuse of the information stored, however, is possible. In Figure 2(b), two copies of V2 are stored in consecutive locations in the scan memory. It is possible to store just one copy of V2 and scan in V2 as often as possible during consecutive scan cycles. Similarly, we can replace two copies of V3 in locations 4, and 5, from the left side of the scan memory in Figure 2(b), with just one copy of V3. Further reduction of the number of copies of V3 is not possible. Thus, we store the sequence {V1, V2*, V3*, V4, V3, V5, V1, V3} and repeatedly scan in the vectors marked with the flag *. Information about vectors that need to be scanned in multiple times is stored in control memory. Thus, 8 instead of 10 vectors need to be stored. In the preceding example, the scan storage requirement was reduced at a price. Since vectors that are scanned in repeatedly do not form a regular pattern, the control memory requirement can increase drastically. To avoid such an increase in control memory, we impose a restriction that, except for the first vector and last vector stored in the scan memory, every vector is scanned in ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

264



X. Liu et al.

exactly twice. In Figure 2(c), the sequence {V1, V2, V3, V4, V1, V3, V5} is stored. Assuming that all but the first and last vectors are scanned in twice, the set of transition test patterns applied is (V1, V2,), (V2, V3), (V3, V4), (V4, V1), (V1, V3,), (V3, V5). This set of patterns includes all the test patterns of Figure 2(a) as well as (V4, V1). Thus, by storing only 7 vectors (instead of 10 vectors), all the transition tests can be applied. For this example, not only is the control memory requirement lower, but the number of vectors stored is also reduced. Sequences in which all but the first and last vectors are scanned in twice are defined as transition test chains. 3.2 Transition Test Chains via Weighted Transition Graph Computing transition test chains is different from computing a set of transition test patterns as is conventionally done. A novel algorithm, called weighted transition graph algorithm, to compute such chains is discussed next. The algorithm constructs transition test chains from a given stuck-at test set. Instead of computing a set of vector pairs and chaining them together as alluded to in the preceding examples, the problem is mapped into a graph traversal problem. The algorithm builds a weighted directed graph called the weighted transition-pattern graph. In this graph, each node represents a vector in the stuck-at test set; a directed edge from node Vi to Vj denotes the transition test pattern (Vi, Vj), and its weight indicates the number of transition faults detected by (Vi, Vj). Directed edges may potentially exist between every node pair, resulting in a large (possibly complete) graph. In order to reduce the time required to construct the graph, only the subset of the faults missed by the original stuck-at test set are considered. The graph construction procedure is discussed next. We start with the stuck-at test set T = {T1 . . . TN }. (1) Perform transition fault simulation using the stuck-at test set as a transition test set {(T1 , T2 ), (T2 , T3 ). . . (TN −1 , TN )} to compute undet, the set of undetected transition faults. (2) Deduce the subset U of the stuck-at faults implied by undet as follows. If X slow-to-rise or slow-to-fall fault ∈ undet, then both X stuck-at-0 and stuck-at-1 are included in U. (3) Perform stuck-at fault simulation, without fault dropping, using the stuck-at test set T on the stuck-at faults in U. For each stuck-at fault f in U, record the vectors in T that excite f and the vectors that detect f. Also, for each vector, the faults excited and detected by it are recorded. (4) The weighted directed graph contains a node corresponding to each of the stuck-at tests in T. The directed edge, from Vi to V j , is inserted if the corresponding test pattern (Ti , T j ) detects at least one transition fault in undet. The weight of (Ti , T j ) is the number of transition faults in undet detected by (Ti , T j ). For example, consider a circuit with 5 gates (10 stuck-at faults) and a stuck-at test set consisting of 4 vectors V1, V2, V3, and V4. The excitation and detection dictionary obtained by simulation without fault dropping are shown in Table IV. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Efficient Techniques for Transition Testing



265

Table IV. Fault Dictionary Without Fault-Dropping V ector1 V1 V2 V3 V4

E xcited F aul ts a-s-0, b-s-1, c-s-1, d-s-0, e-s-0 b-s-1, c-s-0, d-s-0, e-s-1 a-s-1, c-s-1, a-s-1, b-s-0, d-s-1, e-s-0

Detected F aul ts a-s-0, b-s-1 c-s-0, d-s-0, e-s-1 a-s-1, c-s-1 b-s-0, d-s-1, e-s-0

Assuming the test set order to be {V1, V2, V3, V4}, only 3 transition faults (slow-to-fall at c, e and slow-to-rise at c) are detected. However, using Table IV, we can make the following observations by combining nonconsecutive vectors: (V1, V3) detects a slow-to-fall; (V3, V1) detects a slow-to-rise; (V1, V4) detects d slow-to-fall; (V4, V2) detects d slow-to-rise; (V4, V1) detects a slow-to-rise, b slow-to-fall; and (V2, V4) detects b slow-to-rise, e slow-to-rise and d slow-to-fall. This results in the transition-pattern graph of Figure 4 . Unlike general graphs, this weighted transition graph has a specific property which formulates the following theorem. THEOREM 1. Faults detected by pattern (Vi , V j ), and faults detected by pattern (Vj, Vk) are mutually exclusive. PROOF. We prove this by contradiction. Without loss of generality, consider fault f, slow-to-fall, detected by (Vi , V j ). Thus, Vi excites f s-a-0 (sets line f to 1), and V j detects f s-a-1. Assume (V j ,Vk ) also detects f, slow-to-fall. Then, the initial vector V j must set line f to 1, which is a contradiction. An Euler trail in the transition-pattern graph traverses all the edges in the graph exactly once. By inserting a minimal number of edges to the graph, an Euler trail that traverses all edges would tempt us to think that this is the best test chain. However, this actually only leads to a suboptimal solution. Traversing edge (Vi, Vj) is equivalent to selecting test (Vi, Vj). Once edge (Vi, Vj) is traversed, that is test (Vi, Vj) is selected, it detects a number of transition faults. This alters the weights on other edges and even removes some of the edges. Per Theorem 1, edges whose weights do not change are those originating from Vj. To improve the solution, the edge weights should be updated after traversing each edge. A preliminary version of the algorithm is outlined in Figure 3, where P is the transition test chain computed by the algorithm from the given stuck-at test set T. The idea behind this algorithm is as follows. We are looking for the smallest test sequence that can cover all the detectable faults by traversing the weighted transition-pattern graph. In a traditional weighted graph, traversing any edge in the graph will not affect the weight on other edges; therefore, an Euler trail, which traverse each edge exactly once, will be the optimum solution. But in our case, traversing any edge in the weighted transition-pattern graph may result in the dection and removal of number of transition faults from the fault dictionary. Thus, the weight on other edges, which also detect these faults, will be altered and must be updated by simulation. For example, in the original weighted transition-pattern graph on the lefthand side of Figure 4, (V2, V4, V1) is the heaviest-weight test chain of length 3. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

266



X. Liu et al.

Fig. 3. Generic weighted transition graph algorithm.

Fig. 4. Weighted transition-pattern graph example.

After traversing this chain, 5 transition faults (a slow-to-rise, b slow-to-rise, b slow-to-fall, d slow-to-fall, and e slow-to rise) have been detected. The updated graph is shown on the right-hand side of Figure 4. It should be noted that, in addition to removing the edges (V2, V4) and (V4, V1), the other two edges ((V1, V4) and (V3, V1)) are also removed from the graph. This is because the fault (d slow-to-fall), which can be detected by (V1, V4), has been detected by selecting the test chain (V2, V4, V1). Therefore, the edge (V1, V4) can be removed from the weighted pattern graph because it has no contribution to the future selection of edges. A similar idea can be applied on the edge (V3, V1) as well. Finally, all the 7 undetected faults in Table IV are detected with the test chain {V2, V4, V1, V3, V4, V2}. Several optimizations were applied to the generic version of the algorithm to improve the results. First, we investigated the impact of different lengths of test chain on the number of faults detected. Instead of considering one edge at a time, we use Theorem 1 to inspect path segments of length 3. This reduces the amount of simulation required, and a transition test chain is generated by incrementally concatenating the vector chains of length 3. While this solution is better than simply concatenating vector pairs for each of the remaining undetected transition faults, it still may not be optimal. When the test chain length increases beyond 3, the difference between the actual number of transition faults detected by the chain and the sum of edge weights in the chain can increase. This difference determines whether it is worthwhile to continue extending the chains. Thus, using longer chains reduces the number of graph updates (hence reducing the run-time of the algorithm), but it will increase the ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Efficient Techniques for Transition Testing



267

size of the final solution. Our experiments suggest that the chain lengths of 3 or 4 give the best results. Secondly, expanding on the essential test definition for stuck-at fault from Hamzaoglu and Patel [1999], we define an essential vector for transition faults to be any test vector that excites (or detects) at least one transition fault that is not excited (or detected) by any other test vector in the test set. All essential vectors must occur in the transition test chain at least once. We include essential vectors early in the chaining process. The transition test chain generation process is divided into two phases by constructing two weighted transition-pattern graphs. (1) Identify all essential vectors, generate the transition-pattern graph using only essential vectors and construct the test chains with only the essential vectors in the graph. Append that to the initial transition test chain P in the generic version of the weighted transition graph algorithm. (2) Reduce the number of faults by dropping faults detected in the first step. Generate the transition-pattern graph for the remaining faults and extend the partial transition test chain from the previous Step as described in Step 3 of the generic weighted transition graph algorithm. The number of edges in the second step of the modified algorithm is significantly reduced because most of the edges incident on the essential vectors have been traversed in the previous Step and thus removed. During the test pattern generation procedure, some of the faults detected by the earlier test vectors may also be detected by the test patterns generated later. Therefore, vectors added early in our transition test set might become redundant. To identify such redundant patterns, we perform a reverse-order pair-wise compaction. After (U1 , U2 , U3 ), (U4 , U5 , U6 ), . . . , (Un−2 , U n − 1, Un ) are generated, they are appended to the original stuck-at test set in that order. The test patterns (Un−1 , Un ), (Un−2 , Un−1 ), . . . , (U1 , U2 ) are simulated in the reverse order from which they were generated. If neither (Ui−1 , Ui ), (Ui , Ui+1 ) detects any additional fault, then Ui is redundant and can be eliminated. Note that eliminating a vector Ui will give rise to a new transition test pattern (Ui−1 , Ui+1 ); therefore, we follow this by performing a forward-order pair-wise compaction step to further reduce the size of the test sequence. 4. EXCHANGE SCAN We saw that by using transition test chains and ATE repeat, we can reduce the test data storage. Data presented in Section 6 will show the average reduction to be about 46.5%. However, test application time can sometimes increase due to repeating on every vector in the chain. To reduce test application time while still retaining the improvement in data storage, we propose a new DFT technique. Instead of using the ATE repeat option to re-use the vector, a low overhead alternative is possible in which each vector is only scanned in once. Reducing scan-in operations reduces test application time. The net result is a reduction in both the data storage requirement and the test application time. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

268



X. Liu et al.

Fig. 5. Block diagram of a hold-scan system.

Fig. 6. Hold scan cell and exchange scan timing.

The block diagram of a conventional hold-scan system is shown in Figure 5. The scan cells, each of which consists of two parts, System Latch and Shadow MSFF, are chained together to form two related registers: SYSTEM REGISTER and SHADOW REGISTER. During normal operation, the SYSTEM REGISTER is in use and the SHADOW REGISTER does not play any role. For scan testing options, three scan operations—SCAN SHIFT, SCAN LOAD, and SCAN STORE—are supported. Assume the scan cell implementation of Figure 6. For SCAN SHIFT, the A CLK and B CLK are pulsed so that data passes from the SI to the SOUT. For SCAN STORE, the STORE signal is pulsed to transfer the data from SOUT to Q. The content of SHADOW REGISTER is transferred to the SYSTEM REGISTER. In SCAN LOAD, the contents of the SYSTEM REGISTER is transferred to the SHADOW REGISTER. Pulsing LOAD transfers data from Q to AOUT. Pulsing the B CLK transfers the data from AL to BL. The new operation SCAN EXCHANGE exchanges the data between the SHADOW and the SYSTEM registers, without destroying either of them. Pulsing LOAD transfers the contents of SL to AL. Pulsing STORE transfers the contents of BL to SL. Pulsing B CLK to transfer the content of AL to BL follows this. The corresponding timing diagram is shown in Figure 6(b). No additional hardware or signal is needed to support the exchange operation. It may require the global scan controller to be modified slightly to realize the exchange operation. The exchange operation takes about three clock cycles. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Efficient Techniques for Transition Testing



269

Fig. 7. Scan operation with/without exchange.

SCAN EXCHANGE, for the transition test chain {V1, V2, V3, V4} is used as follows. Test pairs applied are (V1, V2), (V2, V3), (V3, V4). The expected response on application of V2 is R2, V3, is R3, and V4 is R4. The sequence of operations, without exchange scan, is shown in the first column of Figure 7. Capture R2 implies that the response of the CUT on application of V2 is latched onto the SYSTEM REGISTER. Scan in V2, Scan Out R2 implies SCAN SHIFT, where V2 is scanned out and the response of the CUT, from the previous pattern, is compared to R2. The second column of Figure 7 shows the sequence of operations using SCAN EXCHANGE. Once V2 is loaded into the SHADOW REGISTER, the subsequent store and capture operations do not destroy the contents of the SHADOW register. So, the SCAN LOAD operation that destroys the contents of the SHADOW REGISTER is replaced by the SCAN EXCHANGE operation. It exchanges the contents of the SHADOW REGISTER and the SYSTEM REGISTER. Thus, the captured response is transferred to the SHADOW REGISTER, and V2 is applied to the circuit under test as the initial vector for the next test pair. We can, therefore, skip the sequence of operations that rescans V2 and stores it. In addition, the SCAN SHIFT of the response from V2, that is R2, can now be merged with the SCAN SHIFT of the final vector of the ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

270



X. Liu et al.

Fig. 8. Arbitrary starting states in enhanced scan.

next pattern V3. The net effect of this is that we can replace an entire scan operation with a 3-cycle SCAN EXCHANGE operation. Considering that the SCAN SHIFT operation may take 1000 or more clock cycles, this overhead of the SCAN EXCHANGE operation is negligible and will be neglected from our calculations. Note that transition test chains, but not ATE repeat capability in the testers, is required to realize the gains of the exchange scan operation. If the ATE repeat capability is available, each of the vectors that are stored can be compressed as discussed in Keller et al. [2001], and the benefits of transition test chains can be realized using exchange scan. Our experimental result show that both test data volume and test application time decrease by 46.5%, compared to a commercial tool. 5. CONSTRAINED ATPG TO MINIMIZE OVERTESTING In this section, we present a novel algorithm to address the possible yield loss due to overtesting of functionally untestable faults in enhanced-scan testing. A transition fault is functionally untestable if either launching of the transition, or the propagation of its effect, is impossible in the functional mode due to constraints imposed by the circuit. Such constraints include the requirement for an illegal/unreachable state in the test pattern, or the two state combination in the enhanced-scan pattern is functionally impossible. Because the enhanced-scan model assumes total independence between the two vectors in the test pattern, some functionally untestable transition faults may become detected. Figure 8 illustrates the scenario. Since an enhanced-scan pattern consists of (State 1, V1, State 2, V2), if either (1) State 1/State 2 is illegal, or (2) the State 1/State 2 pair is not a valid state transition combination, the transition pattern may detect some of the functionally untestable faults. To avoid detection of such functionally untestable faults, we must make sure that these two scenarios do not arise. We first describe a low cost implication-based, functionally-untestable fault identification method without involving any sequential ATPG. Then we try to minimize the overtesting of these functionally untestable faults in our graphformulated ATPG algorithm described in Section 3. In general, identifying functionally untestable fault in sequential circuits is of the same complexity as sequential ATPG. In Hsiao [2002], a method for identifying untestable stuck-at faults in sequential circuits by maximizing local conflicting value assignments has been proposed. The technique first computes a large number of logic implications across multiple timeframes and stores them ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Efficient Techniques for Transition Testing



271

in an implication graph. Then the algorithm identifies impossible combinations of value assignments locally around each gate in the circuit and those redundant stuck-at faults requiring such impossibilities. For identifying functionally assignments untestable transition faults, in addition to searching for the impossibilities locally around each gate, we also check the excitability of the initial value in the previous timeframe. In other words, if opposing values on a signal in two consecutive timeframes are not possible, we would know that no transition can be launched on that signal in the functional mode. Furthermore, if the corresponding s@ fault becomes unobservable due to the imposed constraints, the transition fault also would be unobservable. In employing this technique, we can identify a large set of untestable transition faults in the circuit. The identified functionally-untestable transition faults are mapped onto the graph traversal problem. We build the weighted transition-pattern graph for the given stuck-at test set as before. Again, in the graph, each node represents a vector in the stuck-at test set, and a directed edge from Vi to Vj denotes the transition test pattern (Vi, Vj). To address the problem of overtesting functionallyuntestable faults, each edge in the weighted transition-pattern graph has two weights: W1 indicates the number of functionally-untestable faults detected by test pattern (Vi, Vj), and W2 represents the number of functionally-testable faults detected by the pattern. Therefore, our target is to achieve the highest transition fault coverage, while minimizing the overtesting of functionallyuntestable faults. The heuristic we used to minimize the overtesting is presented next. Assume the stuck-at test set T = T1 . . . Tn is given. (1) Identify the functionally-untestable fault set R using the transition implication tool. (2) Perform transition fault simulation using the stuck-at test set as a transition test set {(T1 , T2 ), (T2 , T3 ). . . (TN −1 , TN )} to compute undet, the set of undetected transition faults. (3) Deduce the subset U of the stuck-at faults implied by undet as follows. If X slow-to-rise or slow-to-fall fault ∈ undet, then both X stuck-at-0 and stuck-at-1 are included in U. (4) Categorize the transition faults in U into two parts: U1 = U∩R, standing for the functionally-untestable fault set in U, and U2 = U−U1 standing for the functionally-testable fault set in U. (5) Build the fault dictionary for U1 and U2 respectively, using the fault simulation without fault-dropping. And generate the weighted pattern graph with W1 and W2 on each edge. (6) Compute the weight W on each edge, using the formula W = F(W1 , W2 , threshold). For small circuits, we set the threshold to zero and for bigger circuits, we predetermined the threshold as a small fraction of the number of functionally-untestable faults in the circuit. (7) Greedily construct transition test chains with the maximum W and append the transition test chains to the original stuck-at test set T. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

272



X. Liu et al.

Fig. 9. Weighted pattern graph with two weights.

Function F is defined as follows: F =



W2 − W1 if W1 ≤ threshold 0 otherwise

Let us look at the example in Section 3 again. Suppose the transition fault, a slow-to-rise, is a functionally-untestable fault, then the weighted pattern graph in Figure 4 will be modified to Figure 9. Again, recall that every vertex in the figure is a test vector, and there exists an edge between two vertex if they detect at least one fault. Different from the previous example, every edge now has two weights, namely W1 and W2 , which are the number of functionallytestable fault and functionally-untestable faults detected by the pair of test vectors, respectively. So, from the fault dictionary, we can see that (V4, V1) detect one functional testable fault (b slow-to-fall) and one functionally untestable fault (a slow-to-rise). Therefore, the weights on edge (V4, V1) are W1 = 1 and W2 = 1. Similarly, (V3, V1) detects only a slow-to-rise which is a functionallyuntestable fault; correspondingly, the weights on edge (V3, V1) are W1 = 1 and W2 = 0. Therefore, selecting a transition test chain no longer depends on merely the total number of transition faults it can potentially detect. We must distinguish the number of functionally-testable faults and the number of functionallytestable faults each chain can detect. For instance, consider the transition test chain {V2, V4, V1, V3, V4, V2} generated in Section 3. Although all the functionally-testable faults will be detected, the functionally-untestable fault (a slow-to rise) will also be detected. To avoid the overtesting of this functionallyuntestable fault, we generate the transition test chain by avoiding traversal of the edges containing nonzero-W1 . In doing so, the new transition test chain {V2, V4, V2, V1, V3} (that avoids traversing nonzero W1 edges) will now only detect 5 out of the 6 remaining functionally-testable faults because all the vectors that can detect b slow-to-fall also detect the functionally-untestable fault a slow-to-rise. Consequently, a slight drop in fault coverage of the functionallytestable faults may result. We can also relax this condition of avoiding all nonzero W1 edges to increase coverage of functionally-testable faults. In essence, we set a threshold on W1 such that we will still consider some nonzero W1 edges, but we make sure that detection of such untestable faults is bounded by the threshold value. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Efficient Techniques for Transition Testing

273



Table V. Results with Different Chain Lengths 2 Circuit ID s344 38 s382 24 s832 60 s1196 47 s1423 36 s5378 59 s35932 1072 s38417 132

3

AC 38 24 60 47 36 59 1072 132

ID 64 43 87 78 60 103 1175 190

AC 64 43 87 78 60 103 1175 190

4 ID 90 63 114 116 79 149 1270 274

AC 72 44 89 93 79 142 1247 249

5 ID 111 75 136 146 106 190 1382 380

AC 89 56 95 105 95 144 1261 285

6 ID 143 92 158 177 129 235 1475 439

AC 89 61 95 105 118 172 1312 296

7 ID 164 111 180 208 153 284 1564 500

AC 103 61 103 119 124 176 1334 319

Table VI. Results with/without Essential Vectors

Circuit c1355 c1908 c3540 c5315 c6288 s344 s832 s1196 s1423 s5378 s35932 s38417

S@ Set 198 143 202 157 141 31 179 197 97 332 78 1207

Tran. Tests 928 918 1222 816 310 135 988 1004 566 1672 542 5142

Without essential vectors Comp. Time TFC Tests (s) (%) 285 3.51 99.77 318 4.57 99.67 515 25.43 96.27 342 11.80 99.54 122 5.60 99.19 63 0.37 100 310 4.36 99.20 362 5.24 99.97 186 2.25 99.11 722 35.73 98.40 196 133.13 90.50 2682 1073.03 99.66

Tran. Tests 915 966 1181 762 334 207 937 1022 528 1685 633 5208

With essential vectors Comp. Time TFC Tests (s) (%) 270 2.82 99.77 298 3.32 99.67 514 22.06 96.27 313 9.79 99.54 120 5.70 99.19 64 0.37 100 292 2.78 99.20 358 4.38 99.97 177 2.09 99.11 722 29.76 98.40 197 133.01 90.50 2686 858.85 99.66

6. EXPERIMENTAL RESULTS The weighted transition graph algorithm, with all the optimizations described above, was implemented in C. Experimental data are presented for ISCAS85 and full-scan versions of ISCAS89 benchmarks, on a 1.7GHz Pentium 4 with 512MB of memory, running on the Linux Operating System. In Table V, results for the full-scan version of ISCAS89 circuits with different chain lengths are presented. For each benchmark, the ideal (ID), given by the sum of the edge weights, and actual (AC) faults detected by the chains are shown. The difference between the ideal (ID) and actual (AC) increases with the chain length. For example, for circuit s5378, when the chain length is 2, the ideal and actual numbers of detected transition faults are the same. Likewise, when the chain length is increased to 3, they are still equal as explained by the proposed Theorem. When we increase the chain length beyond 3, the actual number of detected transition faults start to differ as some of the faults detected by this last chain segment may be detected by the first 2 pairs. Table VI presents the results of the weighted transition graph algorithm with and without using essential vectors. In Table VI, column 2 gives the number of stuck-at vectors in the original STRATEGATE test set, followed by the results for our algorithm without using essential vectors. The final four columns show the results when essential vectors are used. For each approach, the number of ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

274



X. Liu et al. Table VII. ATE Repeat Vs. COM

CK T c1908 c2670 c3540 c5315 c6288 c7552 s5378 s9234 s13207 s15850 s35932 s38417

Storage TF COV CPU Time IMPROVEMENTS COM WT GR COM WT GR COM WT GR Str. Time s@ T ran 3 4 3 4 177 526 353 346 99.7 99.72 4.2 3.33 3.47 32.89 −34.22 167 396 185 184 78.6 79.26 4.3 9.25 9.99 53.28 6.57 247 732 410 419 82.9 87.62 6.8 19.45 20.11 43.99 −12.02 213 496 310 323 96.6 97.05 4.8 10.20 10.77 37.50 −25.00 47 180 130 118 99 98.54 3.6 3.72 3.62 27.77 −44.44 348 756 428 431 91 91.61 11 28.34 28.21 43.38 −13.23 391 980 455 464 86.6 87.51 5.3 37.66 37.76 53.57 7.14 630 1674 644 646 68.6 70.58 14.7 319.18 324.20 61.53 23.06 662 2004 681 679 80.5 82.29 27.4 292.89 295.30 66.02 32.06 641 1848 729 749 85 85.76 28 350.00 358.58 60.55 21.10 81 274 224 209 90 90.33 94.3 87.56 86.30 18.29 −63.50 1449 3854 1555 1547 89.9 91.19 116 1416.82 1406.83 59.65 19.30

transition vectors produced is shown first. Next, the number of compacted test vectors and its transition fault coverage are shown. Note that the compaction step achieves considerable reduction without losing fault coverage. The transition fault coverage achieved with or without essential vectors are the same, as indicated in column 6 and column 10 of the Table—only the test set sizes are different in the two approaches. In most cases, the use of essential vectors yields smaller test sets. However, because this is a greedy heuristic, optimality is not guaranteed. The execution time with essential vectors is also generally shorter due to the quick elimination of a large number of faults detected by the essential vectors. The extra computation needed in s38417 was due to the lack of trimming in the weighted transition-pattern graph. Because our target was to show the proof of the concept that stuck-at vectors can be chained through the proposed WTG, we did not explicitly target reduction in the execution time. 6.1 Experimental Results for ATE Repeat In Table VII, we compare our results with results using COM, a commercial ATPG tool. We first tabulate the data for the storage required (STORAGE). Both the size of the stuck-at test set and the number of transition test vectors for COM are presented. These were generated using the dynamic compaction option of COM. Thus, for C1908 we need to store a total of 526 vectors. WT GR used the stuck-at test set generated using COM without compaction. The next two columns show the transition test chain lengths obtained using the proposed algorithm, with chain length 3 and 4, respectively. Thus, for C1908, we need to store 353 vectors or 346 vectors depending on the chain length used in the algorithm. The storage improvement obtained using transition test chains and the ATE repeat option is shown in column 11, where a chain length of 3 was assumed. Thus, for C1908, the storage improvement is calculated as 100* (526– 353)/(526). Note the substantial reduction in all cases. The average reduction in scan memory requirement is 46.5%. Columns 6 and 7 compare the transition fault coverage obtained by weighted transition graph (WT GR) and COM. Note that there is no loss in fault coverage ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Efficient Techniques for Transition Testing



275

Table VIII. ATE Weighted Transition-Pattern Graph Algorithm vs. COM CIRCUIT c1908 c2670 c3540 c5315 c6288 c7552 s5378 s9234 s13207 s15850 s35932 s38417

STORAGE COM WT GR 526 353 396 185 732 410 496 310 180 130 756 428 980 455 1674 644 2004 681 1848 729 274 224 3854 1555

IMPROVEMENT DATA, APPTIME (%) 32.89 53.28 43.99 37.50 27.77 43.38 53.57 61.53 66.02 60.55 18.29 59.65

using WT GR. Columns 8, 9, and 10 compare the CPU time required by COM and the two versions of our algorithm. For most of the circuits, COM is much faster. The last column of the table shows changes in the test application time. Recall that for a given transition test chain, all but the first and last vectors are scanned in twice. Therefore, the test application time gain for C1908 is computed as 100*(526-2*353)/(526) = −34.22. This implies a 34.22% increase in test application time if transition test chains with ATE repeat are used. However, in a number of cases, the test application time actually decreases by a significant amount. The average increase in test application time is 6.9%. Next, we will present the results for reducing the extra test application time with exchange scan. 6.2 Experimental Results for Exchange Scan Benefits of using the exchange scan versus COM are reported in Table VIII. Transition test chains were computed using our heuristic with the chain length set to 3. The improvement in both test application time and test data volume are reported in column 4. Thus, for C1908, the test application time reduction is calculated as 100*(526–353)/(526) = 32.89%. We note that now there is a substantial reduction in both the scan memory requirement and the test application time, compared to COM. The average reduction in both test application time and data storage requirement is 46.5%. Results from Tables VII and VIII are illustrated graphically in Figure 10. For each circuit, the data storage requirement and test application time are plotted for the conventional ATE, ATE repeat, and exchange scan. 6.3 Experimental Results for Constrained ATPG In Table IX, we compare the results of the weighted transition graph algorithm with and without considering the constraints on functionally-untestable faults. The number of functionally-untestable faults identified by our implication engine is presented in column 2, followed by the percentage of ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

276



X. Liu et al.

Fig. 10. Graphical experimental results.

functionally-untestable faults for each circuit. The next three columns show the transition fault coverage by the STRATEGATE stuck-at vectors: the total transition fault coverage, the functional testable fault coverage, and the overtesting fault ratio are given, respectively. Next, the results without considering the functionally-untestable faults are shown. And the final three columns tabulate the results on the transition fault coverage while considering the constraint on functionally-untestable faults. For instance, for circuit s5378, our low-cost transition implication engine identified 3695 out of the total 15322 faults as functionally untestable. So the percentage of functionally-untestable faults is 3695/15322 = 24.12%. In other words, only 75.88% (1–24.12%) of the faults can be detected in the functional mode. When looking at the STRATEGATE stuck-at test set for s5378, we found that it can detect 92.96% of the transition faults. While most of the ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Efficient Techniques for Transition Testing



277

Table IX. Results With/Without Constraint

Circuit s344 s832 s1196 s1423 s5378 s35932 s38417

Red fault 66 96 5 387 3695 11255 32086

Red Ratio (%) 6.53 4.26 0.12 9.35 24.12 11.21 27.03

Orig. S@ vectors FC FFC OVT (%) (%) (%) 88.02 82.57 5.45 72.45 69.92 2.53 86.17 86.05 0.12 94.40 86.50 7.90 92.96 75.08 17.89 86.63 85.25 1.38 97.36 72.14 25.22

Without Constraint FC FFC OVT (%) (%) (%) 98.51 92.07 6.44 94.29 90.79 3.50 94.74 94.62 0.12 98.14 89.88 8.26 94.16 75.87 18.29 89.61 87.90 1.71 97.63 72.38 25.25

Constrained ATPG TFC FFC OVT (%) (%) (%) 93.37 91.79 1.58 91.04 90.51 0.53 91.28 91.28 0 88.57 83.29 4.69 84.30 74.86 3.75 89.12 87.42 1.70 87.91 69.34 18.57

functionally-testable faults can be detected (75.87% out of total 75.88%), the overtesting ratio for functionally-untestable faults is 17.29%. If we ignore the overtesting factor and only target those functionally-testable faults, our weighted transition graph algorithm can improve the total transition fault coverage to 94.16%, but at the cost of an overtesting ratio of 18.29%. Finally, if we impose the constraint of minimizing the overtesting of functionally-untestable faults in our graph algorithm, we can reduce the overtesting ratio to only 3.75% and still capture most of the functionally-testable faults. Only 75.88%-74.86 = 1.02% of the functionally-testable faults are missing. Note that for this constrained ATPG, we do not include the original sequence of s@ vectors in our final test set since the original order of vectors can potentially detect many functionally-untestable faults. Results for other benchmark circuits can be explained in a similar manner. 7. CONCLUSION We have presented efficient techniques to reduce test data volume and test application time for transition faults. First, we propose a novel transition test chain formulation via a weighted transition-pattern graph. Only s@ ATPG is needed to construct the necessary test chains for transition faults. By combining the proposed transition test chain and ATE repeat capability, we reduce the test data volume by 46.5%, compared to the conventional approach. The second technique that replaces the ATE repeat option with Exchange Scan improves both the test data volume and the test application time by 46.5%. In addition, we address the problem of yield loss due to incidental overtesting of functionallyuntestable transition faults, By formulating it into a constraint in our weighted pattern graph, we can efficiently reduce the overtesting ratio. The average reduction on the overtesting ratio is 4.68%, with a maximum reduction of 14.5%. REFERENCES CHANDRA, A. AND CHAKRABARTY, K. 2001. Frequency directed run length(FDR) codes with application to system on a chip data compression. In Proceedings of the VLSI Testing Symposium. 42–47. DERVISOGLU, B. AND STONG, G. 1991. Design for testability: Using scanpath techniques for pathdelay test and measurement. In Proceedings of the IEEE International Test Conference. 365–374. DAS, D. AND TOUBA, N. A. 2000. Reducing Test Data Volume Using External/BIST Hybrid Test Patterns. In Proceedings of the IEEE International Test Conference. 115–122. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

278



X. Liu et al.

ELDRED, R.D. 1959. Test routing based on symbolic logical statement. J. ACM, 6, 1 (Jan.), 33–36. HSU, F. F., BUTLER, K. M., AND PATEL, J. H. 2001. A Case Study of the Illinois scan architecture. In Proceedings of the IEEE International Test Conference. 538–547. HAMZAOGLU, I. AND PATEL, J. H. 1999. Reducing Test application time for full scan embedded cores. In 29th International Symposium on Fault-Tolerant Computing. 260–267. HERAGU, K., PATEL, J. H., AND AGRAWAL V. D. 1996. Segment delay faults: a new fault model. In Proceedings of the VLSI Testing Symposium. 32–39. HSIAO, M. S. 2002. Maximizing impossibilities for untestable fault identification. In IEEE Design Automation and Test in Europe Conference. 949–953. KELLER, B., BARNHART, C., BRUNKHORST, V., DISTLER, F., FERKO, A., FARNSWORTH, O., AND KOENEMAN, B. 2001. OPMISR: The foundation of compressed ATPG vectors. In Proceedings of the IEEE International Test Conference. 748–757. KOENEMAN, B. 1991. LFSR-coded test patterns for scan designs. In IEEE European Test Conference. 237–242. LEE, K.-J., CHEN, J.,-J., AND HUANG, C.-H. 1998. Using a single input to support multiple scan chains. In IEEE/ACM International Conference on Computer-Aided Design. 74–78. LIU, X., HSIAO, M. S., CHAKRAVARTY, S., AND THADIKARAN, P. J. 2003. Efficient transition fault ATPG algorithms based on stuck-at test vectors. J. Electr. Test. Theo. Applicat., 19, 4 (Aug.), 437–445. LAI, W. C., KRISTIC, A., AND CHENG, K. T. 2000a. On testing the path delay faults of a microprocessor using its instrcution set. In Proceedings of the VLSI Testing Symposium. 15–20. LAI, W. C., KRISTIC, A., AND CHENG, K. T. 2000b. Test program synthesis for path delay faults in microprocessor cores. In Proceedings of the IEEE International Test Conference. 1080–1089. REARICK, J. 2001. Too much delay fault coverage is a bad thing. In Proceedings of the IEEE International Test Conference. 624–633. SMITH, G. L. 1985. Model for delay faults based upon paths. In Proceedings of the IEEE International Test Conference. 342–349. SAVIR, J. AND PATIL, S. 1993. Scan-based transition test. IEEE Trans. on Comput.-Aid. Des. Integr. Circuit Syst. 12, 8 (Aug.). SAVIR, J. AND PATIL, S. 1994. On broad-side delay test. In Proceedings of the VLSI Testing Symposium. 284–290. TENDULKAR, N., RAINA, R., WOLTENBURG, R., LIN, X., SWANSON, B., AND ALDRICH, G. 2002. Novel techniques for achieving high at-speed transition fault coverage for Motorola’s microprocessors based on PowerPC instruction set architecture. In Proceedings of the VLSI Testing Symposium. 3–8. WAICUKAUSKI, J. A., LINDBLOOM, E., ROSEN, B. K., AND IYENGAR, V. S. 1987. Transition fault simulation. IEEE Des. Test Comput., (April), 32–38. Received March 2003; revised August 2003, January 2003; accepted January 2004

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

A Detailed Power Model for Field-Programmable Gate Arrays KARA K. W. POON, STEVEN J. E. WILTON, and ANDY YAN University of British Columbia

Power has become a critical issue for field-programmable gate array (FPGA) vendors. Understanding the power dissipation within FPGAs is the first step in developing power-efficient architectures and computer-aided design (CAD) tools for FPGAs. This article describes a detailed and flexible power model which has been integrated in the widely used Versatile Place and Route (VPR) CAD tool. This power model estimates the dynamic, short-circuit, and leakage power consumed by FPGAs. It is the first flexible power model developed to evaluate architectural tradeoffs and the efficiency of power-aware CAD tools for a variety of FPGA architectures, and is freely available for noncommercial use. The model is flexible, in that it can estimate the power for a wide variety of FPGA architectures, and it is fast, in that it does not require extensive simulation, meaning it can be used to explore a large architectural space. We show how the model can be used to investigate the impact of various architectural parameters on the energy consumed by the FPGA, focusing on the segment length, switch block topology, lookuptable size, and cluster size. Categories and Subject Descriptors: B.7.2 [Integrated Circuits]: Design Aids General Terms: Design, Experimentation, Algorithms Additional Key Words and Phrases: Power estimation model, architecture, power consumption, sensitivity analysis

1. INTRODUCTION Power dissipation is becoming a major concern for semiconductor vendors and customers. Power is especially a concern in field-programmable gate arrays (FPGAs). The postfabrication flexibility in these devices is provided using a large number of prefabricated routing tracks and programmable switches. These tracks can be long, and can consume a significant amount of energy every time they switch. In addition, the programmable switches add capacitance This research was supported by Altera, Micronet, and the Natural Sciences and Engineering Research Council of Canada. Preliminary versions of parts of this article appeared in the Proceedings of the Conference on Field-Programmable Logic and Applications (2002) and the Proceedings of the IEEE International Conference on Field-Programmable Technology (2002). Authors’ addresses: Department of Electrical and Computer Engineering, The University of British Columbia, 2356 Main Mall, Vancouver, BC, Canada, V6T 1Z4; email: {karap,stevew,ayan}@ece. ubc.ca. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected].  C 2005 ACM 1084-4309/05/0400-0279 $5.00 ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005, Pages 279–302.

280



K. K. W. Poon et al.

to each track; this further increases the power dissipation of FPGAs. Finally, the generic logic structures that are at the heart of every FPGA consume more power than the dedicated circuitry that would be found on an ASIC. For all these reasons, FPGA vendors have indicated that power is one of the primary concerns of their customers. There has been a modest amount of work developing low-power FPGA architectures and FPGA CAD algorithms that optimize for low power [George et al. 1999; Hwang et al. 1998; Kumar and Ravikumar 2002; Lesea and Alexander 2001; Rabaey 1996; Tuan et al. 2001]. Each of these previous studies, however, has presented “point solutions” for specific FPGA architectures or specific FPGA CAD programs. In addition, these works have tended to use fairly crude models to estimate the power savings, and often did not take into account many important design details that may negate any advantages claimed by the proposed techniques. Our long-term goal is to understand and investigate the effects of various architectural and CAD tool optimizations on the power and energy consumed by FPGAs. As a first step in this effort, we have developed a detailed power model for FPGAs based on the Versatile Place and Route (VPR) CAD tool. This power model is flexible, in that it can be used to estimate power in a wide variety of FPGA architectures. It is fast, in that estimates can be obtained without the time-consuming computation of programs such as SPICE, or the reliance on simulations, as in Li et al. [2003]. Also, the model gives good fidelity; although there may be significant absolute errors in the power estimation, the power model is capable of evaluating architectural tradeoffs and the efficiency of power-aware CAD tools based on the relative comparisons among alternative architectures or algorithms. In addition to providing comparisons between architectural alternatives or CAD algorithms, the power model has also been used as an integral part of a power-aware CAD flow, in which energy dissipation is optimized at every stage from logic synthesis to physical design [Lamoureux and Wilton 2003]. We have used this CAD flow in our experiments to investigate the influence of architectural changes on energy consumption. This article is organized as follows. Section 2 introduces the framework of the flexible power model and how it is incorporated into the VPR CAD tool. Section 3 describes the power model. Section 4 presents an analysis of how architectural changes impact energy consumption and provides a sensitivity analysis focusing on the primary input density assumption and the routing algorithm. Finally, our conclusions are given in Section 6. The model is available freely for noncommercial use; the Appendix describes how to obtain the model. 2. VPR FRAMEWORK The VPR CAD tool is a widely used placement and routing tool available for FPGA architectural studies [Betz et al. 1999]. As shown in Figure 1, the original VPR has two components: a place and route tool, and a detailed area and delay model. The place and route tool maps a circuit to the FPGA. The area and delay models estimate the area and critical path delay based on results from the place ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Detailed Power Model for FPGAs



281

Fig. 1. Original VPR framework.

Fig. 2. Modified VPR framework.

and route tool. The two components interact with each other to determine the best placement and routing for a user circuit. A description of the underlying FPGA architecture is provided to the tool in the form of an architecture file, which contains information such as segment length, connection topologies, logic block size and composition, and process parameters. The architecture file is an important feature in VPR—it allows any architecture to be specified, and hence makes the CAD tool highly flexible. Figure 2 shows the modified VPR framework with an activity estimator and a power model for activity generation and detailed power estimation, respectively. In the baseline CAD flow, the activity estimator and the power model are not used to guide the placement and routing. In the power-aware CAD flow, it is possible to use the power estimates to guide the placement and routing process in order to improve the effectiveness of the power optimization techniques. The details of the activity generation and power estimation modules will be described in Section 3. 3. POWER MODEL Our power model is aimed at island-style FPGA architectures, which have logic blocks, switch blocks, connection blocks, and routing, as shown in Figure 3, with an H-tree clock network, as shown in Figure 4. The model has two modules: an activity generation module, and a power estimation module. The first module employs the transition density model to determine the switching activities inside the circuit. The second module estimates the power consumption at the ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

282



K. K. W. Poon et al.

Fig. 3. Island-style FPGA (from Betz et al. [1999]).

Fig. 4. H-tree clock network.

transistor level. The model was calibrated using HSPICE with the technology parameters from TSMC for a 1.8-V, 0.18-µm CMOS technology. However, the model is general enough to apply to any technology. 3.1 Activity Generation Probabilistic techniques are preferred for our activity generation step because of their efficiency in computation. Among all the available probabilistic techniques, the Transition Density Model is the most accurate [Yeap 1998]. Therefore, the Transition Density Model is employed in this power model. The Transition Density Model is based on two parameters for each signal: the transition density and the static probability. The transition density of a signal represents the average number of transitions of that signal per unit time, and the static probability is the probability of the signal being high at any given time. The transition density and static probability values of all the signals are calculated iteratively from the primary inputs to the primary outputs. The propagation of transition density through each lookup table (LUT) can ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Detailed Power Model for FPGAs



283

Fig. 5. Model of the low-pass filter for each LUT (from [Najm 1994b]).

be determined by Equation (1) [Najm 1994a]. As all LUT inputs are assumed to be uncorrelated to each other, each input contributes a static probability, P(∂ f (x)/∂ xi ), and a transition density, D(xi ),to the total density, D( y), at the output.    ∂ f (x) · D(xi ). (1) D( y) = P ∂ xi all inputs Even though the original Transition Density Model applies only to combinational circuits, it can be extended to sequential circuits. For D flip-flops, the output probability can be set to be the same as the input probability. The transition density of the output, D( y), of the flip-flop can be modeled as its transition probability, Pt ( y), written as [Najm 1995a, 1995b]. D( y) = Pt ( y) = 2 · P (x) · (1 − P (x)),

(2)

where P (x) represents the probability that signal x is high. For each sequential feedback loop, a mutual probability is determined for the output of each D-flip-flop through iterations. Even though this is an approximate method for calculating the transition probabilities of the signals in the feedback loops, a previous study has shown that the average error obtained by using this method for three iterations is less than 5% [Tsui et al. 1994]. Due to unequal gate and wire delays in a logic network, the voltage on internal signals may switch more than once during a single clock cycle, before stabilizing. These small pulses are often called glitches, and are an important component of the total overall power. However, the original Transition Density Model does not consider the fact that pulses shorter than the propagation delay of a gate are filtered out because the gate cannot respond fast enough [Najm 1994b]. To simulate the filtration effect of circuit inertial delays, a “low-pass filter” is modeled at the output of each gate (each LUT in an FPGA), as shown in Figure 5. A transition at y is transmitted across the filter only when the input remains stable over a certain period of time. A probability distribution function, with the pulse width as a parameter, is used to determine whether an input pulse is propagated to the output [Najm 1994b]. As shown in Figure 6, activity estimation is carried out in three steps: LUT organization, static probability calculation, and transition density calculation. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

284



K. K. W. Poon et al.

Fig. 6. Pseudocode of the transition density algorithm.

First, LUTs are ordered from the primary inputs to the primary outputs. For sequential circuits, the outputs of the flip-flops are initially assumed to be primary inputs. Then, the calculation of the static probability is carried out for each LUT output signal. In sequential circuits, several iterations may be required. Finally, the transition density calculation is performed. As part of the transition density calculation, the CAD tool determines whether glitches exist at the output of each LUT; the low-pass filter is applied at the output of the LUT to eliminate unrealistic activity values. 3.2 Dynamic Power Estimation After the switching activities have been determined, the next step is to analyze the power dissipation at the transistor level for each component inside the FPGA. The average power consumption in digital circuits consists of three main components: dynamic, short-circuit, and leakage power [Kang and Leblebici 1999]. The estimation methodology for these three components will be described in this and the following two sections. The model for each component has been evaluated using HSPICE. Dynamic power is the dominant component of the total power. It is dissipated every time a signal changes due to the charging and discharging of load and parasitic capacitances. Therefore, dynamic power is closely related to the transition density of all nodes inside the circuit. The total dynamic power dissipation can be written as  Dynamic power = 0.5 · C y · Vsupply · Vswing · D( y) · f clk · (3) all nodes

The expression 0.5 · C y · Vsupply · Vswing · D( y) determines the energy per clock cycle, where Vswing is the swing voltage of each node, Vsupply is the supply voltage, D( y) is the transition density at node y, and C y is the capacitance of node y that is charged and discharged during each transition. The dynamic power is then equal to the energy per clock cycle multiplied by the clock frequency, f clk , which is bounded by the critical path delay of the circuit.

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Detailed Power Model for FPGAs



285

Fig. 7. Example FPGA routing segment.

3.2.1 Routing Resource Dynamic Power. To estimate dynamic power, we separate the resources in an FPGA into three categories: routing resources, logic blocks, and the clock network. We estimate the power dissipated by resources in each category separately. This subsection focuses on the power dissipated in the routing fabric; the next two subsections focus on the logic blocks and the clock network. A large part of the dynamic power is due to switching tracks within the routing fabric of the FPGA. As described in Section 3.1, the power dissipated in the fabric can be calculated using the transition density and capacitance of each track. Since we wish our power model to be flexible enough to model the power in any FPGA that can be described within VPR, and since the capacitances of the routing tracks vary greatly with the track length and the number of attached buffers, a single value for track capacitance will not suffice. Instead, we extract capacitance information from the routing resource graph within VPR for each metal track separately. Figure 7 shows an example metal track that spans four logic blocks and is attached to a number of programmable switches. In general, the capacitance of a track depends on the number of logic blocks spanned by the segment, the size of each logic block (since a larger logic block implies a longer metal track), the number of pins on each logic block, the switch block and connection block connectivities, and information about the target technology. The sizes of each of the buffers were estimated in Betz et al. [1999] to optimize the speed of the FPGA. Using this information, the overall capacitance of each track is estimated by adding the metal capacitance of the track itself and the parasitic capacitances of all switches attached to the track. More details on the routing resource graph and the capacitance calculation can be found in Betz et al. [1999]. After calculating the capacitance information for each track, the overall dynamic power of the routing fabric is calculated. For each net in the design, the capacitance of all tracks that are used to route the net are summed, and the activity of the net is then used, along with this capacitance, to calculate the power dissipated by that net. This is summarized in Figure 8. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

286



K. K. W. Poon et al.

Fig. 8. Algorithm for calculating dynamic power of the routing fabric.

Fig. 9. Comparison of our model and HSPICE for one routing segment.

To verify the model, an HSPICE model was created. Figure 9 shows the power predicted by the model along with the power predicted by HSPICE, for a range of segment lengths. The wires were switched at 20 MHz to ensure that the wires had fully charged or discharged during each cycle. As the graph shows, the model results match the HSPICE results very closely, with an average error of 4.8%. 3.2.2 Logic Block Dynamic Power. Like the power model for the routing fabric, the power model for the logic block must be flexible. It must accommodate any lookup table size, any number of lookup tables in each cluster, and any number of inputs to each cluster. The model assumes the architecture shown in Figure 10, and consists of four components: the power dissipated in the lookup tables, the power dissipated in the input multiplexers, the power dissipated in the flip-flops, and the power dissipated in the other nodes and wires within the logic block. (a) Power dissipated in the lookup tables. Lookup tables in FPGAs are commonly implemented as multiplexer trees. To estimate the power dissipated in a multiplexer tree, we represent the tree as a set of two-input multiplexers as shown in Figure 11. We then use the transition density model (as before) to estimate the activity of each node within the lookup table. The capacitance of each node within the lookup table is estimated by noting that each node is associated with three source/drain capacitances and one gate capacitance (the gate capacitance is due to the Miller effect spread over two transistors as described in [Kang and Leblebici [1999]). ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Detailed Power Model for FPGAs



287

Fig. 10. Schematic of a logic block (from Betz et al. [1999]).

Fig. 11. Modeling of a two-input lookup table using two-input multiplexers.

To verify the model, an HSPICE simulation was used. The power predicted by the HSPICE simulation depends heavily on the relative switching times of the inputs. Since our model does not take this into account, we performed several thousand HSPICE simulations, each with different relative input switching times. For each combination of input arrival times, we measured the power predicted by HSPICE. Figure 12(a) shows the maximum power obtained from the HSPICE simulations (over all signal arrival time combinations), the minimum power obtained from the HSPICE simulations, and the average power obtained from the HSPICE simulations, all as a function of the transition density of each input. The measurements obtained from our model are plotted on the same graph. As the graph shows, our estimate lies within the maximum and minimum HSPICE predictions. The graph also illustrates that the model power is closer to the maximum HSPICE prediction than the minimum prediction; this is expected because the transition signal density model assumes all inputs to the multiplexers ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

288



K. K. W. Poon et al.

Fig. 12. Comparison of model and HSPICE results for lookup tables.

Fig. 13. Modeling of a four-input multiplexer using two-input multiplexers.

are switching at different times, which is the worst-case scenario. Figure 12(b) shows the same results as a function of the lookup table size; again, the same conclusions hold. Note that, in both sets of results, the fidelity (relative difference between power estimates) between the model predictions and HSPICE results is very good. The average difference between the maximum HSPICE prediction and the modeled values is 14.5%. (b) Power dissipated in the input multiplexers. The input multiplexers select the lookup table input signals from among the routing tracks. Since these multiplexers are similar in structure to the lookup tables, the modeling is similar. There are, however, two important differences. First, as illustrated in Figure 11, the gates of the pass transistors inside the LUTs are connected directly to the internal routing; therefore, the internal nodes inside the LUT can be affected by the body effect of the pass transistors, and may swing at a degraded supply voltage. On the other hand, as shown in Figure 13, the gates of the pass transistors inside the input multiplexers are connected to SRAM cells. We assume that the SRAM cells are powered by a higher voltage than the core voltage, meaning that the internal nodes inside the input multiplexers are not affected by the body effect and swing at the full core voltage. A second reason that the input multiplexers have different power behavior than the multiplexers within the lookup tables is that the internal nodes within input multiplexers are often more correlated to each other ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Detailed Power Model for FPGAs



289

Fig. 14. Comparison of model and HSPICE results for input multiplexers.

than those within the LUTs. Consider the example in Figure 13. When input 0 is selected, transistors A, C, and E are turned on, and node n 1 and the output node of the multiplexer always switch at the same time. Such a phenomenon in LUTs may not happen as frequently as in the input multiplexers because the input signals to the LUTs can switch at different times. To investigate this, we repeated our HSPICE comparisons for the input multiplexers. As shown in Figure 14, the HSPICE results are roughly 20% lower than the model predictions. Based on these empirical results, we scaled the power dissipation in the input multiplexers by 80% to better estimate the actual power dissipation. Figure 14 shows both the original power model predictions as well as the predictions after the scaling. (c) Power dissipated in flip-flops. To determine the dynamic power dissipated inside each D-flip-flop in an FPGA logic block, a detailed transistor-level HSPICE model was simulated at various clock frequencies to investigate the relationship between the input density and the power dissipation. Based on the simulation results, we used the curve-fitting facilities of Matlab to derive the following: Dynamic power (DFF) = 0.5 · CDFF · (Effective density) · Vsupply · Vswing · f clk , (4) Effective density = (−0.074) · D(input) + (5.2486) · D(input)2 ,

(5)

where D(input) is the transition density of the input signal for the D-flip-flop, Vsupply is the supply voltage, Vswing is the swing voltage, and f clk is the clock frequency. The quantity CDFF is the total capacitance of all nodes inside a flip-flop that toggle when a flip-flop changes state (this was estimated using reasonable transistor sizes and source/drain overlaps for our flip-flop circuit). Figure 15 shows the comparisons between our results and the HSPICE estimates; the average difference between the simulation results is 10.5%. (d) Power dissipated in clock tree. Finally, the dynamic power of the clock network is determined by assuming an H-tree clock network, as shown ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

290



K. K. W. Poon et al.

Fig. 15. Dynamic power of D-flip-flop versus input density.

Fig. 16. RC ladder network corresponding to a clock tree with two clock buffers.

in Figure 4. The clock network consists of a set of clock buffers connected using clock segments, as shown in the diagram. The optimum number of clock buffers and clock segments, as well as the optimum buffer size, depends on the size of the FPGA. Since we want our model to be flexible enough to estimate the power for any size FPGA, we have developed a method of predicting the number and size of the clock buffers and segments based on the size of the FPGA. Given the number of logic blocks in the FPGA, we can calculate X , the length of the longest path from the clock source to a flip-flop clock pin (we calculate this distance based on the physical dimensions of the logic blocks, which, in turn, depend on the logic block architecture). We then model a single path in the clock tree network as a distributed RC ladder network as shown in Figure 16. In general, there are M stages (corresponding to M clock buffers), and each clock buffer is of size N . In the example of Figure 16, M = 2. By solving the RC equation corresponding to the ladder network, and by differentiating with respect to N and M , we can estimate the number of clock buffers as  Rw Cw X 2 M = (6) 2∗ Rt ∗ (Cd + C g ) ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Detailed Power Model for FPGAs

and the relative drive strength of each buffer as  Rt Cw , N= Rw C g



291

(7)

where Rw and Cw are the wire resistance and capacitance per unit length, Rt represents the resistance of the clock buffer, and C g and Cd represent gate and drain capacitance of the clock buffer, respectively. Using these values of M and N , the power dissipated in the clock network can be calculated as before. 3.3 Short-Circuit Power Short-circuit power is dissipated through a direct current path between the power supply and ground during each transition. Short-circuit power is a function of the rise and fall time and the load capacitance [Eckerbert and Larsson-Edefors 2001; Wang and Vrudhula 1999]. We model the short-circuit power as 10% of the dynamic power calculated in Section 3.2. This percentage was obtained using HSPICE simulations and parameters from FPGA datasheets [Altera 2001; Xilinx 2001]. 3.4 Leakage Power Leakage power dissipation comes from two sources: reverse-bias leakage power and subthreshold leakage power. As the majority of leakage power is from subthreshold current [Leshavarzi et al. 1997], the reverse bias leakage current is assumed to be negligible. A first-order estimation model is applied to estimate the subthreshold current [Kang and Leblebici 1999]:   (Vgs − Von )q Idrain (weak inversion) = Ion · exp , (8) nkT where Von is the boundary between the weak and strong inversion regions. The following equation is used to calculate Von : Von = Vt +

n=1+



q NFS Cox

nkT , q



+



Cd Cox

(9) 

,

(10)

where Ion is the drain current at the boundary when Vgs is equal to Von . The velocity saturation model [Toh et al. 1988] is employed to calculate Ion : Ion =

W vsat Cox (Vgs − Vt )2 , (Vgs − Vt ) + Ec Leff

(11)

where W is the device width, vsat is electron velocity, Ec is the piecewise carrier drift velocity, Leff is the effective source-drain channel length, Vgs is the gatesource voltage, Vt is the threshold voltage, and Cd is the capacitance associated with the depletion region. The constants k and q are Boltzman’s constant and the elementary charge, respectively. T is the temperature in kelvins. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

292



K. K. W. Poon et al.

The quantity NFS is the number of fast surface states. It is a current fitting parameter that determines the slope of the subthreshold current-voltage characteristic [Kang and Leblebici 1999]. Each temperature has a specific NFS value. To determine the NFS values of NMOS and PMOS transistors, HSPICE simulations have been run for both types of transistors, with different transistor sizes and over the temperature range from −40 to 100◦ C. To be conservative, the Vgs value is assumed to be half of the threshold—0.2V. The average error between the estimated values and the simulated results is 13.4%. The leakage power is calculated by multiplying the subthreshold current with the supply voltage: Leakage power = Idrain (weak inversion) · Vsupply

voltage .

(12)

All the logic blocks and routing switches, including the unused logic blocks and unused routing switches, are considered in the leakage power calculation. The leakage current of each SRAM cell can be defined by the users in the architecture input file in order to include the SRAM leakage in the power estimation. 4. ARCHITECTURE EXPERIMENTS AND SENSITIVITY ANALYSIS 4.1 Methodology To investigate the impact of architectural parameters on the power consumption of an FPGA, we conducted experiments using both the baseline FPGA CAD flow [Betz et al. 1999] and a power-aware FPGA CAD flow [Lamoureux and Wilton 2003]. We followed the architecture evaluation flow proposed in Betz et al. [1999]. This architecture flow is outlined in Figure 17. Each benchmark circuit was optimized by SIS [Sentovich et al. 1992]. In the baseline flow, each circuit was technology-mapped using FlowMap [Cong and Ding 1994]. Each circuit was then packed into logic clusters using TVPack [Betz 2000] and placed and routed using VPR [Betz et al. 1999]. The activity estimator was applied to the mapped circuit to estimate the transition densities for all the nodes. In the power-aware flow, the activity of each node is used to guide the technology-mapping (Emap), clustering (P-T-Vpack), placement, and routing (PVPR) steps. In all cases, the smallest square FPGA with sufficient logic blocks and pads was assumed. VPR and PVPR were employed to determine the minimum number of tracks (Wmin ) required for the circuit. Then we performed a final “low-stress” routing of each circuit with the number of tracks per channel set to 20% more than Wmin . Fixed channel widths for each given benchmark circuit were used to ensure that the architectures used in both CAD flows were the same in order to produce unbiased experimental results. Detailed power estimation was performed at the end using the activity, capacitance, and timing information obtained throughout logic synthesis, placement, and routing. The 20 largest Microelectronics Centre of North Carolina (MCNC) benchmark circuits were used for these experiments. SRAM cells were assumed to use high threshold voltage devices, in which leakage current is negligible. Instead of using power for evaluation, we express our results in terms of energy. Energy is the product of the clock period at which the circuit is run ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Detailed Power Model for FPGAs



293

Fig. 17. Architectural evaluation flow Betz et al. [1999]. Table I. Parameters Under Investigation Parameters Segment length Switch block Cluster size LUT size

Description The number of logic blocks spanned by each wire segment Switch block topology The number of LUTs per logic block The number of inputs per lookup table

and the power dissipated by the circuit. Using energy as the metric can avoid favoring architectures and implementations where the power is reduced simply by slowing down the clock. Of course, this does not imply that the most energyefficient architecture is necessarily the best; an FPGA designer would need to trade off delay for energy, depending on the target market and intended applications. Our experiments focused on the effects of the four architectural parameters listed in Table I. We varied these parameters one at a time. The routing architecture consisted of 50% pass-transistor switched wires and 50% tristate buffer switched wires. The fraction of wires in each channel to a logic block input pin, Fc input, was 0.6, and the fraction of wires in each channel to a logic block output, Fc output, was 0.25 for the architecture with four-input LUTs and a cluster size of 4. As the cluster and LUT sizes changed, the number of cluster inputs, routing switch sizes, and wire length per unit segment were adjusted ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

294



K. K. W. Poon et al.

Fig. 18. Disjoint, Universal, and Wilton switch block topologies [Masud and Wilton 1999].

Fig. 19. Imran switch block [Masud and Wilton 1999].

accordingly [Ahmed 2000]. The static probability and the transition density for each primary input were set to 0.5. In all cases, we assumed a 0.18-µm CMOS technology using a Vdd of 1.8 V. 4.2 Segment Length and Switch Block Topology The FPGA routing fabric consists of many prefabricated segments. The length of these prefabricated segments is one of the key decisions that an FPGA architect must make. In Betz et al. [1999], it was shown that a segment that spans four logic blocks is good for speed and area; in this section, we determine which segment lengths work well for energy. Intuitively, the longer each routing segment, the more energy it will require to switch the segment. On the other hand, longer segments result in fewer switches. This may result in a decrease in energy. Since the optimum choice for segment length is so tightly coupled with the optimum choice for the switch block topology, we considered both parameters in the same set of experiments. Four switch block topologies, Disjoint [Lemieux and Brown 1993], Universal [Chang et al. 1996], Wilton [Masud and Wilton 1999], and Imran [Masud and Wilton 1999] were considered. The four topologies are shown in Figures 18 and 19. The disjoint switch block connects each pin to pins with the same pin number on the three other sides of the switch block. The Universal switch block focuses on maximizing the number of signals that can be made through a switch block at the same time. The Wilton switch block is similar to the Disjoint, except that the diagonal connections from the pins ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Detailed Power Model for FPGAs



295

Fig. 20. Routing energy versus segment length.

are rotated by one track. A previous study has shown that the Wilton switch block provides good routing flexibility, but lower area-efficiency when long segments are employed, compared with the Disjoint block [Betz et al. 1999]. The fourth topology, the Imran switch block, provides both good flexibility and area-efficiency by combining aspects of the Disjoint topology and the Wilton topology [Masud and Wilton 1999]. Figure 20 shows the impact of segment length and switch block topology on the routing energy dissipation, averaged over all benchmark circuits. In these experiments, the number of inputs per lookup table was kept at four, and the number of lookup tables per cluster was also fixed at four. The segment length was varied from 1 to 16 (all segments were assumed to have the same length). The results from both CAD flows show that circuits dissipated less energy when routed with shorter wires. This conclusion further confirms the finding from Shang et al. [2002] that FPGA designs should take advantage of the locality of wire connections. In addition, the results show that the Imran and Disjoint switch block topologies are preferable for all segment lengths. 4.3 Cluster Size and Lookup Table Size In this section, we investigate the impact of the number of inputs per lookup table, and the number of lookup tables per cluster, on the energy dissipated by an FPGA. First, consider the impact of cluster size on energy. Intuitively, an architecture composed of larger clusters can perform more complicated functions in each logic block, meaning fewer clusters are required to implement a given circuit. On the other hand, large clusters imply larger input select multiplexers and longer wires within a cluster. The results shown in the left side of Figure 21 show that the optimum cluster size depends on both factors. In ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

296



K. K. W. Poon et al.

Fig. 21. Energy versus cluster size and energy versus LUT size.

these experiments, both the number of inputs per lookup table and the segment length was fixed at four, and a disjoint switch block topology was assumed. The results show that as the cluster size was increased, the energy dissipated in the logic blocks increased, as expected. The energy dissipated in the routing fabric did not change dramatically; as the clusters got larger, fewer connections between the clusters were required, but these connections were longer, since the clusters were bigger. The graphs show that the clock energy decreased; this is counterintuitive, since the total number of flip-flops remained the same. The reason for this behavior is that the clock branches within the logic block were accounted for as part of the logic block power. Overall, the most energy-efficient cluster size was between 8 and 10 lookup tables. The right half of Figure 21 shows the energy dissipation within the FPGA as a function of LUT size. In these experiments, we fixed the cluster size and the routing segment length to four, to be consistent with previous work in [Betz et al. 1999] (ideally, we would repeat all experiments for all combinations of LUT size and cluster size; however, this is not feasible). Results from both the baseline and the power-aware CAD flows show that the logic block energy increased with the LUT size while clock energy decreased with the LUT size. The energy consumed by the routing fabric initially decreased as the LUT size grew, but for large LUT’s the energy began to rise. Larger LUTs are capable of more complex functions, meaning fewer logic blocks are required for the same circuit and fewer clock branches are needed. However, larger LUTs have more internal connections and, therefore, increase the size of the logic blocks; this boosts the energy dissipation on both internal routing and block-to-block connections. Overall, the baseline CAD flow gives an optimal LUT size of three while the power-aware CAD flow gives an optimal LUT size of six. Overall, the LUT size of four seems to be a good choice for energy-efficient architectures ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Detailed Power Model for FPGAs



297

Fig. 22. Energy distribution among the routing fabric, logic blocks, and clock network.

according to both CAD flows. Our experimental results obtained for variable cluster sizes and LUT sizes demonstrate similar trends to those in Li et al. [2003]. It is interesting to compare the impact of logic cluster size and logic block size on energy. Intuitively, we may expect that a change in LUT size will have a more significant effect on overall power than a change in cluster size, since the area due to the LUT increases exponentially as the number of inputs to the LUT increase. The graphs show, however, that both architectural parameters have a similar impact on power. There is a fundamental difference between increasing the LUT size and increasing the cluster size. As the LUT size is increased, fewer LUTs are required, meaning there are fewer flip-flops, less routing, and (importantly) fewer clock connections. These tend to counteract the exponential increase in power that would be intuitively expected. Although the logic block area does go up quadratically, the area of the LUT itself is much smaller than the area of the input multiplexers and the rest of the cluster circuitry, meaning that the exponential effect is not seen. On the other hand, increasing the cluster size essentially rearranges the LUTs within the fabric. The total logic area is not reduced, since the same number of LUTs and clock connections are required, regardless of cluster size. Comparing the two CAD flows in Figure 21, it can be seen that circuits generated by the power-aware flow dissipate on average 27% less energy in the routing fabric and 12% less energy in the logic blocks than those generated by the baseline flow. Nevertheless, the energy distribution among the routing fabric, logic blocks, and the clock network within the circuits remains relatively the same for both CAD flows. As shown in Figure 22, between 50% and 60% of the total energy consumption is due to the routing fabric; 20% to 40% is due to logic blocks, and 5% to 40% is from the clock network. These observations match the power dissipation distribution in Shang et al. [2002]. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

298



K. K. W. Poon et al.

Fig. 23. Energy versus primary input transition density.

4.4 Sensitivity Analysis This section studies the sensitivity of two major experimental assumptions: the primary input transition density and the routing algorithms used during the experiments. All the experiments were conducted using the same architecture evaluation flow as described in Section 4.1. In each case, an architecture with a cluster size of four, LUT size of four, and segment length of four was assumed. The routing architecture consisted of 50% pass-transistor switched wires and 50% tristate buffer switched wires. Because the switching characteristics of the primary inputs are not available for the benchmark circuits, researchers often model primary inputs as normalized random signals with a static probability of 0.5 and a transition density value of 0.5; the signals are assumed to switch at a rate of 25% of the clock frequency. On the other hand, FPGA vendors suggest that typical switching rates of inputs range from 6% to 12% of the clock frequency; this corresponds to a transition density from 0.12 to 0.24 [Xilinx 1999, 2002]. This discrepancy is important because the transition density values assumed for the primary inputs can have a significant impact on power evaluation. The energy consumed by the routing and the logic blocks increases with primary input transition density, as shown in Figure 23. Note that the primary input transition density has more effect on the routing energy than the logic block energy because the routing wires contribute more capacitance than logic blocks. However, the primary input transition density assumption does not affect the clock energy since the dedicated clock network is usually separated from the general-purpose routing. As the majority of energy is consumed by the routing fabric, routing algorithms used in the experiments can have a significant impact on energy evaluation. To investigate this, we considered three routing algorithms, including breadth-first, timing-driven, and activity-timing-driven. The baseline CAD flow was employed for technology-mapping, clustering, and placement. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Detailed Power Model for FPGAs



299

Fig. 24. The effect of different routing algorithms on critical path delay and power.

Different routers were used in the routing step. We assumed the same architecture as in the previous set of experiments, and fixed the channel widths to ensure unbiased results. Figure 24 shows the results. As shown in the figure, the average critical path delay of the circuits routed by the breadth-first algorithm was 5.9 times more than that achieved by the timing-driven algorithm, while the average critical path delay obtained by the activitytiming driven algorithm was only 2% more than that for the timing-driven counterpart. The breadth-first algorithm was able to route circuits with 78% less power compared to the timing-driven algorithm. Intuitively, this makes sense. The timing-driven router tends to give preference to timing-critical nets. As a result, the noncritical nets may be longer. The goal of the breadth-first router is to minimize the total capacitance, which, in the absence of activity information, is the best way to reduce power. The timing-driven router is less concerned with minimizing the total capacitance, and more concerned with optimizing the capacitance specifically on the critical nets. The activity-driven algorithm, on the other hand, can only achieve an average power consumption which is 4% lower than that obtained by the timing-driven algorithm. In terms of energy, circuits routed using the breadth-first algorithm consumed, on average, 30% more energy than those routed using the timing-driven algorithm, and circuits routed using the timing-driven algorithm dissipated, on average, 5% more energy than those routed by the activity-timing-driven algorithm. These results are shown in Figure 25. The purpose of these experiments is not to compare the routers, but instead, to investigate the sensitivity of the architectural conclusions on the routing assumptions. From Figure 25, it is clear that if the breadth-first algorithm is used, a different conclusion regarding segment length would be drawn. This is in line with observations in Yan et al. [2002]. 5. CONCLUSIONS In this article, we have presented a detailed power model that is flexible enough to estimate power dissipation on a wide variety of island-style FPGA architectures. The tool is flexible, in that it can be used to estimate power in a wide ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

300



K. K. W. Poon et al.

Fig. 25. Routing energy versus segment length for different routing algorithms.

variety of FPGA architectures. It is fast, in that estimates can be obtained without the time-consuming computation of programs such as SPICE, or the reliance on simulations. Finally, the model gives good fidelity; although there may be significant absolute errors in the power estimation, the power model is capable of evaluating architectural tradeoffs and the efficiency of power-aware CAD tools based on the relative comparisons among alternative architectures or algorithms. We have shown how the model can be used to evaluate architectural tradeoffs and estimate the effectiveness of CAD tools. Both the baseline (timing-driven) and power-aware FPGA CAD flows were applied for our evaluation. We found that short segments are more energy-efficient than long segments, that the Disjoint and Imran switch block topologies are more energy-efficient than the Wilton or Universal topologies, and that a cluster size of 8–10 and a lookup table size of four are most energy-efficient. Our study confirms that routing fabric contributes to a major portion of the total energy. FPGA designers should take advantage of short wires. Our investigation also indicates that energy evaluation can be affected by assumptions regarding the primary input transition density and the routing algorithm. APPENDIX The model described in this article is freely available for noncommercial use from http://www.ece.ubc.ca/∼stevew. ACKNOWLEDGMENTS

Many thanks to Dr. Vaughn Betz for providing the VPR CAD tool, and to Julien Lamoureux for providing the power-aware version of the VPR CAD tool. The authors are grateful to Dr. Resve Saleh, Dr. F. N. Najm, and Li Shang for their helpful discussions. REFERENCES AHMED, E. 2000. The effect of LUT and cluster size on deep-submicron FPGA performance and density. In Proceedings of the ACM International Symposium on Field-Programmable Gate Array. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Detailed Power Model for FPGAs



301

ALTERA. 2001. APEX 20K programmable logic device family data sheet, ver. 4.1. Altera Corp., San Jose, CA. Web site: www.altera.com. BETZ, V., ROSE, J., AND MARQUARDT, A. 1999. Architecture and CAD For Deep-Submicron FPGAs. Kluwer Academic Publishers, Norwell, MA. BETZ, V. 2000. VPR and T-VPack user’s manual, ver. 4.30. CONG, J. AND DING, Y. 1994. Flowmap: An optimal technology mapping algorithm for delay optimization in lookup-table based FPGA designs. IEEE Trans. Comput.-Aid. Des. Integrat. Circ. Syst. 13 1, 1–12. CHANG, Y. W., WONG, D., AND WONG, C. 1996. Universal switch modules for FPGA design. ACM Trans. Des. Automat. Electron. Syst. 1, 80–101. ECKERBERT, D. AND LARSSON-EDEFORS, P. 2001. Interconnect-driven short-circuit power modeling. In Proceedings of the Euromicro Symposium on Digital Systems Design. 412–421. GEORGE, V., ZHANG, H., AND RABAEY, J. 1999. The design of a low energy FPGA. In Proceedings of the International Symposium on Low Power Electronics and Design. 188–193. HWANG, J. M., CHIANG, F. Y., AND HWANG, T. T. 1998. A re-engineering approach to low power FPGA design using SPFD. In Proceedings of Design Automation Conference. 722–725. KANG, S. M. AND LEBLEBICI, Y. 1999. CMOS Digital Integrated Circuits: Analysis and Design. McGraw-Hill, New York, NY. KUMAR, R. AND RAVIKUMAR, C. P. 2002. Leakage power estimation for deep submicron circuits in an ASIC design environment. In Proceedings of the 15th International Conference on VLSI Design (Jan.). LAMOUREUX, J. AND WILTON, S. 2003. On the interaction between power-aware FPGA CAD algorithms. In Proceedings of the IEEE International Conference on Computer-Aided Design. LEMIEUX, G. G. AND BROWN, S. D. 1993. A detailed router for allocation wire segments in fleid programmable gate arrays. In Proceedings of the ACM Physical Design Workshop. LESEA, A. AND ALEXANDER, M. 2001. Powering Xilinx FPGAs, XAPP158, ver. 1.4. Xilinx Inc., San Jose, CA. Web site: www.xilinx.com. LESHAVARZI, A., ROY, K., AND HAWKINC, S. 1997. Intrinsic leakage in low power deep submicron CMOS ICs. In Proceedings of the IEEE International Test Conference. 146–155. LI, F., CHEN, D., HE, L., AND CONG, J. 2003. Architecture evaluation for power-efficient FPGAs. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays. 175–184. MASUD, M. I. AND WILTON, S. J. E. 1999. A new switch block for segmented FPGAs. In Proceedings of the International Conference on Field-Programmable Logic and Applications, PP. 274–281. NAJM, F. N. 1994a. A survey of power estimation techniques in VLSI circuits. IEEE Trans. VLSI Syst. 2, 4 (Dec.), 446–455. NAJM, F. N. 1994b. Low-pass filter for computing the transition density in digital circuits. IEEE Trans. Comput.-Aide. Des. 13, 9 (Sept.), 1123–1131. NAJM, F. N. 1995b. Power estimation techniques for integrated circuits. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. NAJM, F. N. 1995a. Feedback, correlation, and delay concerns in the power estimation of VLSI circuits. In Proceedings of the ACM/IEEE Design Automation Conference. 612–617. RABAEY, J. M. 1996. Digital Integrated Circuits: A Design Perspective. Prentice-Hall, Englewood Cliffs, NJ. SENTOVICH, E. M. ET AL. 1992. SIS: A system for sequential circuit analysis. Tech. rep. UCB/ERL/M92/41. University of California, Berkeley, Berkeley, CA. SHANG, L., KAVIANI, A. S., AND BATHALA, K. 2002. Dynamic power consumption in virtex-II FPGA family. In Proceedings of the ACM International Symposium on Field-Programmable Gate Array. 157–164. TOH, K. Y., KO, P. K., AND MEYER, R. G. 1988. An engineering model for short-channel MOS devices. IEEE J. Solid-State Circ. SC-23, 4 (Aug.), 950–958. TSUI, C. Y., PEDRAM, M., AND DESPAIN, A. M. 1994. Exact and approximate methods for calculating signal and transition probabilities in FSMs. In Proceedings of the Design Automation Conference. 18–23. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

302



K. K. W. Poon et al.

TUAN, T., LI, S.-F., AND RABAEY, J. 2001. Reconfigurable platform design for wireless protocol processors. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2 (May). 893–896. WANG, Q. AND VRUDHULA, S. B. K. 1999. A new short circuit power model for complex CMOS gates. In Proceedings of the IEEE Alessandro Volta Memorial Workshop on Low Power Design (Volta99). 98–106. XILINX. 1999. Understanding XC9500XL CPLD power, ver. 1.1. Xilinx, Inc., San Jose, CA. Web site: www.xilinx.com. XILINX. 2001. Virtex-E 1.8V field programmable gate arrys datasheet, ver 2.2. Xilinx, Inc., San Jose, CA. Web site: www.xilinx.com. XILINX. 2002. Virtex power estimator user guide, XAPP152, ver. 1.1. Xilinx, Inc., San Jose, CA. Web site: www.xilinx.com. YAN, A., CHENG, R., AND WILTON, S. J. E. 2002. On the sensitivity of FPGA architectural conclusions to experimental assumptions, tools, and techniques. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays. 147–156. YEAP, G. 1998. Practical Low Power Digital VLSI Design. Kluwer Academic Publishers, Norwell, MA. Received March 2003; revised October 2003, June 2004; accepted August 2004

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Optimized Wafer-Probe and Assembled Package Test Design for Analog Circuits SOUMENDU BHATTACHARYA and ABHIJIT CHATTERJEE Georgia Institute of Technology

It is well known that wafer-probe test costs of analog ICs are an order of magnitude less than the corresponding test costs of assembled packages. It is therefore natural to push as much of the testing process into wafer-probe testing as possible to reduce the scope of assembled package testing. However, the signal drive and response observation capabilities during wafer probe testing are limited in comparison to assembled packages. In this article, it is shown that by using band-limited transient test signals, which can be supported by wafer-probe test instrumentation, significant numbers of bad ICs can be detected early during the wafer-probe test. The optimal test stimuli are determined by cooptimizing the wafer-probe and assembled package test waveforms. Overall test costs, including the cost of packaging bad ICs, are minimized and are reduced up to four times. The proposed method has been validated using hardware test data, which were obtained through measurements made on a prototype. Categories and Subject Descriptors: B.8.1 [Performance and Reliability]: Reliability, Testing, and Fault-Tolerance General Terms: Algorithms, Experimentation, Measurement Additional Key Words and Phrases: Assembled package, wafer-probe, test cost minimization, cooptimization, simulation, prototype, test, test generation and co-optimization, analog and mixedsignal test

1. INTRODUCTION Due to the increasing complexity and speed of analog, mixed-signal, and RF ICs, test engineers are often faced with characterization and new production testing challenges to ensure high product quality without incurring excessive testing costs. In a typical IC manufacturing process, first the bare dies on each silicon wafer are tested using a wafer-probe tester. This procedure is called This research was supported by the Semiconductor Research Corporation under contract number SRC-TJ-680.001. Authors’ addresses: School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332; email: {soumendu,chat}@ece.gatech.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected].  C 2005 ACM 1084-4309/05/0400-0303 $5.00 ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005, Pages 303–329.

304



S. Bhattacharya and A. Chatterjee

Fig. 1. Models used for WP and AP test setups.

Fig. 2. Wafer-probe model.

wafer-probe test. Subsequently, the “good” die are diced, packaged, and tested again; a procedure, which is known as assembled package test, is performed to eliminate any bad ICs passed by the wafer-probe test procedure. During wafer-probe test, contact is made with each die on the wafer through traveling probes that touch down on the die pads. Typically, for analog circuits, DC tests and power supply current tests are performed during wafer-probe (WP) test. All other specifications are measured during assembled package (AP) test, namely, slew rate, gain, noise figure, etc. The electrical performance of the probe is modeled using a passive network and is referred to as the wafer-probe model in Figure 1. Such a passive network for an industrial wafer-probe tester is shown in Figure 2. The probe model connected to the input of the device under test (DUT) in Figure 1 limits the signal that can be applied to the circuit under test. The same model connected to the output of the DUT limits the external tester observation capability of the response to the applied stimulus. Similarly, during AP test, the signals that can be seen by the circuit under test are limited by the package parasitics. Package models associated with the inputs and outputs of ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Optimizing Wafer-Probe and Assembled Package Test Design



305

Fig. 3. Assembled package parasitic model.

the circuit under test are used to model the test signal application and device response observation constraints during AP test (Figure 1). Such a package model is shown in Figure 3. During test stimulus generation, the suitability of a candidate test stimulus for WP or AP test is evaluated by simulating the circuit under test using the associated end-to-end models of Figure 1. The objective is to be able to evaluate the performance of the embedded circuit under test by observing the output of the corresponding model of Figure 1 (these are the signals seen by an external tester). Furthermore, the goal is to detect most of the bad dies during WP test, even when the signal drive limitations imposed by the WP model are significant. As an extreme example, the circuit under test may be a 1-GHz device, while the wafer probe may be “good” only up to 300–400 MHz. Parasitic models for the HPL-94-18 Tester and a DIP40 ceramic package are used in this article for WP and AP models, respectively. These are shown in Figure 2 and Figure 3, respectively. 2. PREVIOUS WORK In the past, there has been significant effort in the areas of reducing specification testing time by eliminating unnecessary tests and ordering them in an optimal way Brockman and Director [1989]. However, these tests are still time-consuming and expensive. In Milor and Viswananathan [1989], the authors proposed using extra tests during wafer-probe test to eliminate faulty packaged circuits. In Soma and Devarayanadurg [1994], the authors proposed a way to generate tests for analog circuits using fault models. Nagi et al. [1993] presented a test frequency selection algorithm for AC test using behaviorallevel fault modeling. In Salamani et al. [1994], a test generation algorithm for detecting single and multiple faults based on circuit sensitivity computation was presented. In Abderrahaman et al. [1996], the authors solved the test frequency selection problem through optimization techniques. Zheng et al. [1996] ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

306



S. Bhattacharya and A. Chatterjee

used a digital test generator tool to obtain test stimuli for analog circuits. Tsai [1991] showed that the difference between the faulty and fault-free circuits can be maximized using a quadratic programming-based transient test optimization procedure. Balivada et al. [1996] used a ramp as the primary test stimulus for analog circuits. The process of alternate testing consists of applying a short transient test stimulus to the circuit under test and analyzing the corresponding (sampled) test response to predict the circuit’s specifications. In the alternate testing approach, the test specifications of the circuit-under-test are not measured directly using conventional methods (such as circuitry and stimulus to measure CMRR, for example). Instead, a specially crafted stimulus is applied to the circuit-under-test and the (conventional) test specification values are computed (predicted) from the observed test response. In general, all the test specifications can be computed from the response to a single applied test stimulus. Variyam and Chatterjee [1997, 1998, 2000] presented an assembled package test method that relies on the application of a carefully designed transient test stimulus for implicit specification testing [Gomes and Chatterjee 1999] of analog ICs. In this approach, the specification values of the DUT are predicted from the response of the DUT to the applied test stimulus. This approach is not directly applicable to wafer probe test, as the signal drive and response observation capabilities during wafer probe test are limited due to parasitics from long cables and probes used in the procedure. Moreover, the various test cost components of wafer-probe and assembled package tests need to be considered concurrently to get the maximum benefits from both the tests and for constructing an optimal wafer-probe and corresponding assembled package test program that minimizes the overall test cost. In Bhattacharya and Chatterjee [2002a], we incorporated bandwidth constraints for wafer-probe test during test generation. Later, in Bhattacharya and Chatterjee [2002b], we used nonlinear regression models [Friedman] to predict the specifications from the responses for both assembled package and wafer-probe tests and cooptimized the tests for wafer-probe and assembled package. In Bhattacharya and Chatterjee [2003], we showed this method can be applied to high-speed devices as well and, at the same time, excellent accuracy in specification prediction can be achieved by the proposed test methodology. 3. OBJECTIVES AND MOTIVATION In this article, we propose to use transient tests for both WP and AP test. Transient tests can be designed to extract the maximum amount of information about the circuit under test even when WP performance is not as accurate as that of the circuit under test. In this way, a significant number of bad ICs can be detected during wafer-probe test itself. Further, given all the test cost components of the wafer-probe and assembled package test procedures, new algorithms can be developed for cooptimizing wafer-probe and assembled package tests to minimize the overall test cost and test time, while ensuring high coverage and improving yield. Figure 4 shows the test framework used in the industry for pass/fail binning of the circuits/devices. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Optimizing Wafer-Probe and Assembled Package Test Design



307

Fig. 4. WP and AP test framework.

The test procedure described in this article follows a test method where a DUT is tested at a frequency much lower than the operating frequency of the circuit. As the test method requires the application of a customized waveform (probably a piece-wise linear waveform), the only capabilities that are required in the tester are an arbitrary waveform generator (AWG) and a digitizer operating at a much lower frequency than the DUT operating frequency. Since these capabilities come as a standard option for most ATEs, there would be no need for an expensive ATE to implement this method. 3.1 Test Approach In the case of WP test, traveling probes test the whole die and there is very little mechanical movement involved (stepping time is approximately 120 ms) (Electroglas 4090 Fast Prober Data Sheet; go to http://www.electroglas.com). In AP test, every time a new chip needs to be tested, it is put into the test site by a handler, followed by sorting and putting it into the correct bin depending on the outcome of the test (total time required for handling is 300 ms, indexing time being 150 ms) (Aetrium 55v6 Test Handler Data Sheet; go to http://www. aetrium.com). This involves a lot of extra mechanical movement and causes an increase in the test cost/second for AP. Thus, WP test cost per second (approximately 1.5 c/s) is significantly less than AP test cost per second (approximately 5 c/s). Therefore, bad ICs detected during wafer-probe test can help reduce the overall test cost dramatically. During production testing, all bad ICs that pass WP test are packaged for AP test and are eventually discarded after AP test. A good WP test procedure minimizes the number of such bad ICs that are packaged and retested. On the other hand, too long WP test times (∼5 × AP test time) can offset the advantages of WP test. In our proposed test generation approach, one-time establishment costs for wafer probers or handlers were not considered; a generalized methodology was provided to bring down the production test cost. Several studies concerning the ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

308



S. Bhattacharya and A. Chatterjee

cost of probers and handlers to reduce overall test costs have been performed by various researchers, but in this article, the focus is on reducing the overall manufacturing cost by minimizing functional test costs by reducing test times and packaging costs by using the alternate test methodology. In our approach, given the WP test cost/second, AP test cost/second, process statistics, yield, packaging, and final product cost, we find the optimal WP and AP test stimuli that minimize the overall test cost. For WP test stimulus evaluation, WP parasitic models are used, and for AP test stimulus evaluation, AP parasitic models are used. The final tests are obtained by cooptimizing the WP and AP test stimuli. The WP test cost/time is usually less than the AP test cost/time. However, the proposed algorithm can be configured to accommodate any relation between WP and AP test costs. Using different test cost values for a specific device might result in a different waveform with different duration, but the proposed method of optimization remains the same. As the test waveform generation and test times are solely dependent on the test cost, it is certain that individually changing the test costs for WP and AP will change the duration of the test waveform. The results of the WP (transient) test and AP (transient) test are mapped to the test specifications of the packaged ICs for pass-fail decision. Test guard bands are set so that all good ICs pass WP test. A few bad ICs pass WP test and are detected during AP test. The guard bands for the assembled package test are set in such a way that all bad ICs are detected by the test. Some yield loss may occur by using the proposed test procedure due to a few good ICs being classified as bad, but this ensures that the test coverage remains very high and is able to detect all the faulty ICs in the lot. 3.2 Fault Model In our approach, the process parameters (namely, device width, length, oxide thickness, zero-bias threshold voltage, etc.) are varied to generate different instances of the DUT. The process parameters are varied according to a statistical distribution (assuming a Gaussian distribution) to mimic the actual manufacturing process using the mean and the standard deviation values. The parametric variations are modeled as shown in Figure 5. For each such combination of the process parameters, the specification value is computed and, thus, the specification space is computed. This work focuses on functional test generation for the DUT. Random or spot defects, which are manifestations of catastrophic faults, namely, open, short, or bridging faults, cause large deviations from the nominal specification values (Figure 5). Therefore, any spot defect present in the DUT will affect the response to the optimized input waveform considerably, resulting in a grossly inaccurate estimation of specification(s). In such a case, the DUT will lie outside the test limits imposed by the test engineer. Thus, using the proposed algorithm, any DUT with spot defects will definitely be eliminated. In another case, there might be some spot defects present that do not affect the specifications or the test response at all, or affect it by a very small amount. This essentially means that the specific spot defect is not affecting the DUT performance in ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Optimizing Wafer-Probe and Assembled Package Test Design



309

Fig. 5. Parametric fault modeling.

any way. Thus, the proposed algorithm can handle the effect of spot defects although they are not generated as a part of parametric variation of process parameters. 4. TEST ALGORITHM The DUT performance parameters are directly related to the process parameters. While all the process parameters affect the DUT performance in some way or other, accurate test generation can be performed by considering a smaller “critical” set of process parameters that bear high correlations to the specification values. Identifying this set of “critical” parameters is necessary to reduce the simulation complexity of the test generation algorithm. So, before the test generation process starts, a search algorithm is used to find the critical process parameters. Section 4.1 describes the algorithm for identifying these process parameters. 4.1 Computing Critical Process Perturbations For a given the process statistics, N process vectors, each consisting of different assignments of the n process parameters [ p1 , p2 , . . . , pn ], can be generated using statistical sampling. For this purpose, the process parameters are varied around a mean value, with a specific standard deviation and the process and circuit parameter vectors are generated in such a way such that it covers the whole process space within the standard deviation limits. These process vectors (that impact the DUT’s performance) are applied to the DUT and corresponding sets of m specification values of interest constituting the specification vector [s1 , s2 , . . . , sm ] are measured. Hence, N specification vectors are generated, corresponding to each of the N process vectors. Once the process parameter and specification vectors are generated, nonlinear modeling using multivariate adaptive regression splines (MARS), explained in Section 4.1.1, is used to relate the process parameters to the specifications. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

310



S. Bhattacharya and A. Chatterjee

Using the models generated, critical process parameters are identified following a greedy search algorithm, as explained in Section 4.1.2. The relation between the responses of the DUT and the specifications is nonlinear. Using a simpler modeling option, for example, Taylor series expansion, for constructing the models may not capture all the nonlinear relationships between the responses and the specifications. For the above-mentioned reason, MARS were used as the modeling option. 4.1.1 MARS Model Generation. Multivariate adaptive regression splines [Friedman] are used for developing the nonlinear model that relates the process parameters to the DUT’s test specifications. The MARS algorithm mainly depends on the selection of a set of basis functions and a set of coefficient values corresponding to each basis function to construct the nonlinear model. The model can also be visualized as a weighted sum of basis functions from the set of basis functions that span all values of each of the independent variables. MARS use two-sided truncated functions of the form (t − x)+ and (x − t)+ as basis functions for linear and nonlinear relationships between the dependent and independent variables, t being the knot positions. The basis function has the form   x−t x>t . (1) (x − t) = 0 otherwise The basis functions together with the model parameters are combined to generate the model, which can predict the dependent variables from the independent variable values. The MARS model for a dependent variable y, independent variable x, and M basis functions can be summarized as y = f (x) = β0 +

M 

m=1

  βm Hkm xv(k,m) ,

(2)

where the summation is over the M independent variables, and β0 and β m are parameters of the model (along with the knots t for each basis function, which are also estimated from the independent data). The function H is defined as M    Hkm xv(k,m) = hkm ,

(3)

k=1

where xv(k,m) is the kth independent variable of the mth product. During the forward stepwise placement, basis functions are constantly added to the model. After this implementation, a backward procedure is applied, when the basis functions, associated with the smallest increase in the least-squares fit are removed, producing the final model. At the same time, the generalized cross validation error (GCVE), which is a measure of goodness of fit, is computed to take into account the residual error and the model complexity. The above equation can be further decomposed into sum of linear, square products, cubic products, and so forth. The accuracy can also be changed by introducing a larger or smaller number of basis functions. The degree of nonlinearity of the model can be changed by changing the order of products. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Optimizing Wafer-Probe and Assembled Package Test Design



311

Fig. 6. Selecting critical process parameters.

4.1.2 Selection of Critical Process Parameters. As described above, a nonlinear mapping from the n-dimensional process vector space (p) to the mdimensional specification vector space (s) is constructed using MARS, as shown in Equation (4): s = (p) · (p).

(4)

A set of critical process parameters is identified using a greedy algorithmbased search method. First, one set process vector and the corresponding specification vector are eliminated from the observation set. The reduced process parameter set, (p′ ), is then related to the reduced specification space (s′ ) using a MARS model, as described in Equation (5): (s)′ =  ′ (p′ ) · (p)′ .

(5)

N such models are created, each time eliminating one process vector and the corresponding specification vector. Each model is then used to predict the eliminated specification vector by using the corresponding eliminated process vector, used as the input to the model. Next, the error between the estimated and the actual specification vector is computed. The process vector, the exclusion of which produces the largest error, is considered the most critical process parameter vector. The eliminated process vectors, for which the corresponding model produces error values above a certain threshold, are selected as critical vectors. The algorithm is shown in Figure 6 [Press et al. 1998]. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

312



S. Bhattacharya and A. Chatterjee

4.2 Core Algorithm During the manufacturing process as devices are fabricated, there is no way to get access to the process data unless measurements are made on the device after manufacturing. Thus, after manufacturing, one does not know the process data and hence there is no knowledge about the process and circuit parameters. Therefore, we cannot use a methodology to get the specifications directly from the process and circuit parameters. The process parameters were varied to generate different instances of the DUT, to mimic the actual manufacturing process. However, in an actual manufacturing process, all the process parameters vary around the nominal value, with a specific statistical distribution. In simulation, it is nearly impossible to vary all the process parameters. So, a set of critical process parameters was chosen. For example, the process parameters chosen included oxide thickness and width and length of the MOS devices. Initially, k critical process perturbations are computed for all the specifications, as described in Section 4.1. The algorithm starts with a set of N waveforms and a set of k critical process vectors. The initial choice of waveforms is random, where several candidate waveforms, namely, sine waves, pulses, and piecewise linear waveforms, or a combination of those, are used as a preliminary guess for both WP and AP tests. These are successively cooptimized to generate the final test waveform. (MATLAB Optimization Toolbox User’s Guide; go to www.mathworks.com). First, the test waveforms for WP test and AP test are constructed independently, depending on the frequency and current limitations of each test setup. N such pairs of waveforms, for WP and AP, are constructed, which are treated as the initial population space for cooptimizing. All the test waveforms are applied to the DUT while circuit is perturbed with the critical process perturbation vectors (obtained by using the algorithm in Section 4.1). Each of the N pairs of waveforms is simulated for all the k critical process variations to obtain the corresponding k response waveforms. The specifications of the DUT under each of the k process variations are already computed while calculating the critical process vectors. From the data, a MARS model relating the sampled response waveform to the DUT’s specifications is constructed (given the response waveform, the model computes the DUT’s specification values). This is done separately for the WP and AP test waveforms. Let us denote the model relating the WP responses to the specifications by MWP and, the model relating the combined WP and AP responses to the specifications by Mcomb . For each WP and AP test waveform set, two such models are created. The process is described in Figure 7. Now, the fitness of each of the models (MWP and Mcomb ) is found from a set of reference process vectors and their corresponding specification values, which are also computed beforehand. By using these process vectors to perturb the DUT instances, the test waveforms are applied to the DUT and the response obtained is used as the input to the model. From the response and the model, estimates of the specifications are obtained. These are then compared to the actual values and errors are computed for each model. The error between the specifications predicted from the model and the actual specifications obtained ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Optimizing Wafer-Probe and Assembled Package Test Design



313

Fig. 7. Test generation algorithm.

from the circuit measurements serve as the fitness of the model. The error can be obtained as shown in Equation (6): SpecNN (k) =

S 1 N N Specactual ( j ) − Specpredicted (k, j ) for ∀k ∈ {1, . . . , N }, (6) S j =1

N where, k indicates the stimulus currently being considered, Specactual ( j ) is the normalized specification computed for the reference process vectors, and N Specpredicted (k, j ) is the normalized specification predicted by the model made from the circuit output response measurements with all the process variations applied for the same stimulus. j represents the index for normalized specification value, which is computed as shown in Equation (7). With those error values, the cost is computed, which is described in Section 4.3.

SpecN ( j ) =

P 1  Spec(i, j ). P i=1

(7)

After the cost for all the waveforms are computed, the waveform, which gives the least cost is chosen and is used to evolve the next set of waveforms using a genetic algorithm. A genetic algorithm comes in handy for optimization [Goldberg 1989; Chaiyaratana and Zalzala 1997]. It generates the optimized waveform for specification prediction and searches the solution space of a function using simulated evolution, that is, survival of the fittest strategy. In general, the fittest individuals of any population tend to reproduce and survive to the next generation, thereby improving successive generations through mutation, crossover, and selection operations applied to individuals in the population. An outline for a generalized genetic algorithm is shown in Figure 8. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

314



S. Bhattacharya and A. Chatterjee

Fig. 8. Genetic algorithm framework.

Fig. 9. Waveform modification procedure.

To keep track of the test generation procedure and the fitness of the test, a few past cost values are stored. More and more time points are added according to the changes in the cost as the test progresses. The algorithm for modifying the test waveforms is shown in the pseudocode in Figure 9. 4.3 Optimization Cost Function In the proposed algorithm, the test engineer provides these specification limits before the test generation is started. During testing, these specification limits are used to make a pass/fail decision about the circuit-under-test. Before describing the cost function and its various parts, different features that constitute the cost function and which must be taken into account are described. The test cost has the following components: (1) Wafer-probe test time (WPTT) × wafer-probe test cost/time (WPTC); (2) Assembled package test time (APTT) × assembled package test cost/time (APTC); (3) Cost incurred due to the packaging of bad ICs after wafer-probe test, but which are actually detected by assembled package test and eventually rejected; (4) Cost incurred due to good ICs being considered as bad after final tests. Let us define the errors obtained from the models MWP and Mcomb as Error and WPError, respectively. These errors are used to compute the areas, ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Optimizing Wafer-Probe and Assembled Package Test Design



315

Fig. 10. Specification distribution used to compute the test cost.

designated α, β, and γ in Figure 10, each of which are described in the next few paragraphs. N denotes the total number of ICs tested. The picture shown in Figure 10 shows the distribution (Gaussian) of specifications for any typical manufacturing process. The total area under the graph represents the total number of ICs that are considered for test purposes. During cost computation, the area considered is normalized with respect to the total area and multiplied by the total number of ICs tested, as shown in Equation (8), to obtain the number of ICs that fall within that category: NumICs = TotallCs ×

AreaType . AreaTotal

(8)

The first part of the test cost is the cost incurred due to the test time involved in wafer-probe testing. The total test time for wafer-probe test times the waferprobe test cost per unit time gives the wafer-probe test cost. All ICs with specifications outside the specification limits are considered bad. However, in this case, a certain uncertainty is introduced due to the errors in the models, as described in Section 4.2. The errors, Error and WPError, indicate the relative error present in the prediction process. Therefore, any IC with a predicted specification value exactly at the Specification limit can have the predicted specification anywhere between ± WPError around the Specification limit. Therefore, there is an equal chance of this IC being good or bad and, therefore, this IC cannot be classified completely. However, if the predicted value itself is at Specification limit + WPError, then definitely the IC is bad. Therefore, all ICs with predicted values equal to and above Specification limit + WPError are considered bad. This area in Figure 11 represents the ICs eliminated at the wafer-probe test, and the number of bad ICs is computed using Equation (8) (MATLAB Statistics Toolbox User’s Guide; go to www.mathworks. com). The ICs that have their specification values within the Specification limit and the value Specification limit + WPError are actually bad ICs, but they are packaged and pass on for assembled package testing. Thus, the cost of packaging these ICs is an unwanted cost brought in due to the test inaccuracy, namely, the error in prediction of MWP . This cost is included in the overall test ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

316



S. Bhattacharya and A. Chatterjee

Fig. 11. Test generation algorithm flowchart.

cost. The area β in Figure 11 allows us to calculate the number of bad ICs that are packaged. Due to inherent errors in test procedures, generally test engineers set the specification limits tighter than the ones set by the designers. This is to ensure that despite the inaccuracies in measurements during testing, all bad ICs are definitely eliminated. However, at the same time, some good ICs are eliminated because of the tighter limits. This adds up to the overall test cost. In this case, the test limit is determined from the overall Error. The test limit is set to Specification Limit—Error. Using the area γ in Figure 11, the number of good ICs that are rejected can be found. From the above discussion, the total test ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Optimizing Wafer-Probe and Assembled Package Test Design



317

Table I. Different Components Used For Cost Computation Case Study I

Case Study II

Case Study III

Cost Component Final product cost Packaging cost WP test time AP test time Final product cost Packaging cost WP test time AP test time Final product cost Packaging cost WP test time AP test time

Value Used $0.50 $0.20 150 ms 500 ms $0.50 $0.35 350 ms 750 ms $1.50 $0.50 1.8 s 4s

cost is obtained using the following function: Cost = WPTT × WPTC × N + packaging cost per IC × β N + APTT × APTC × (1 − α) × N + cost of individual IC × γ N . (9) 5. RESULTS In the following, results obtained from the developed prototype are presented using the cost function and the optimization approach as per the algorithm shown in Figure 11. The designs made for this article were developed with references from Allen and Holberg [2002], Gray et al. [2001], and Openbook [1999]. Also, Neter et al. [1996] was used to develop the required statistical background for this work. Results obtained from simulation for three circuits are presented below. The first example (Case Study I) shows an experiment to demonstrate that significant number of bad ICs can be detected even when the cutoff frequency of the wafer-probe is less than the bandwidth of the DUT. The second example (Case Study II) shows results for a high-frequency amplifier. The final example is an RF mixer (Case Study III). The DUT cost and the packaging cost for each IC and the test time for each of the ICs using standard specification test are shown in Table I. The test cost for WP test was 1 c/s and 3 c/s for AP test. Using these values, the test cost for standard specification test was computed. Altogether 500 ICs of each type were considered for test generation and validation purposes. A large number of devices are used to ensure a proper statistical sampling. The test cost data was obtained from industry members of the Semiconductor Research Corporation (SRC) (go to www.src.org). Case Study IV involved validation of the proposed WP and AP test methodology using hardware measurements made on the National Semiconductor LM318 operational amplifier (National Semiconductor LM318 Data Sheet; go to http://www. national.com). Table II shows the different operating frequencies of the devices, and the frequency at which they are tested using the proposed methodology. The wafer probe cutoff frequency is the frequency up to which the signals applied to the prober are not distorted. This is caused by the parasitics introduced by long ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

318



S. Bhattacharya and A. Chatterjee Table II. Operating Frequency and WP Frequency Limits For Different Devices OpAMP Comparator Mixer Hardware measurement

Operating Frequency 500 kHz 250 MHz 900 MHz 10 kHz

Test Frequency 200 kHz 50 MHz 100 MHz 8.5 kHz

Table III. Accuracies Achieved for Different Specifications (Case Study I) Specifications Gain bandwidth Low-freq. gain CMRR DC offset Overshoot

Nominal Value 55 MHz 76 dB −93 dB 2.5 mV 3.4 mV

Minimum Error in Prediction (AP Test) 1.09% 0.94% 2.42% 4.33% 2.89%

cables, and sharp contact probes used during WP test. These introduce current and frequency limitations on the signals and distort high-frequency signals considerably before they reach the device. The input corner frequency is the maximum frequency at which the device can operate without any introduction of nonlinearity into the signal. 5.1 Case Study I This circuit was designed and used as the circuit-under-test. The bandwidth of the amplifier was close to 500 kHz. The wafer-probe test waveforms were limited to 200 kHz due to the package parasitics. This limit was imposed by a Tow-Thomas filter connected in front of the DUT. The frequency limitations imposed in case of the wafer-probe test could be changed by changing the cutoff frequency of the filter. While determining the number of bad ICs, the total number of bad ICs found for all the specifications were added up and divided by the number of specifications. In this example, the individual bad ICs were not tracked. This was done as the correlation between the specifications was not computed and hence the assumption was made that the bad ICs were equally distributed over all the specifications. Table III shows the nominal specification values for the amplifier and the best accuracy achieved for each of the specifications. Table IV shows test cost/IC and total ICs rejected at wafer-probe for different yielding processes. In addition, a study was performed to see how the error in prediction changes as the frequency limitation imposed by the prober at WP test is changed. Table V shows that the error in prediction increases as the frequency limit is increased. Figure 12 shows a convergence curve for test generation and the changes in cost as the test is optimized. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Optimizing Wafer-Probe and Assembled Package Test Design



319

Table IV. Comparison of Test Cost and ICs Eliminated at WP Test for Different Yield Values Yield Good and bad ICs Bad ICs detected at WP test Bad ICs eliminated after AP test Savings in packaging cost Test cost/IC

99.0% 495 5 3 2 $0.60 $0.017

91.60% 458 42 20 22 $4.40 $0.0267

88.40% 442 58 29 29 $5.80 $0.0311

Table V. Change in Error in Prediction with Change in Frequency Limit of WP Frequency Limit 200 kHz 170 kHz 130 khz 100 kHz

Error in Prediction (%) 1.5 2.7 6.7 9.5

Fig. 12. Cost convergence curve.

Figure 13 shows a comparison of the number of bad ICs eliminated at waferprobe level compared to the total number of bad ICs using the test generated by the algorithm described above. Figures 14 and 15 show the optimized test waveforms for assembled package test and wafer probe test, respectively. 5.2 Case Study II The circuit-under-test was a high-frequency, low-offset operational amplifier, the frequency response of which is shown in Figure 16. Table VI shows the specifications of the device. This bandwidth limitations for WP and AP tests were imposed by the parasitic models shown in Figures 2 and 3, respectively. The bandwidths of the parasitic models are ∼50 MHz and 600 MHz, respectively, for the WP and AP models. As shown in Table VI, specifications which far exceed the cutoff frequencies of the models could be predicted very accurately. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

320



S. Bhattacharya and A. Chatterjee

Fig. 13. Comparison of bad ICs eliminated at WP versus the total bad ICs.

Fig. 14. Assembled package test waveform for Case Study I.

Fig. 15. Wafer-probe test waveform for Case Study I. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Optimizing Wafer-Probe and Assembled Package Test Design



321

Fig. 16. AC response of the circuit-under-test. Table VI. Accuracies Achieved for Specifications Specifications Gain bandwidth Phase margin Low-freq. gain CMRR DC Offset

Nominal Value 1.1 GHz 37◦ 53 dB 79 dB 235 µV

Minimum Error in Prediction (AP Test) 0.73% 1.28% 0.18% 0.52% 3.76%

Table VII. Comparison of Test Cost and ICs Eliminated at WP Test for Different Yields Yield Good and bad ICs Bad ICs detected at WP test Bad ICs eliminated after AP test Savings in packaging cost Test cost/IC

98.60% 493 7 6 1 $2.10 $0.018

88.40% 431 69 53 16 $18.55 $0.0271

Table VI shows the nominal specification values for the amplifier and the best accuracy achieved for each of the specifications. The specifications of the circuits were individually tracked and the ICs for which any one of the specifications was out of limits were rejected as bad ICs. Tables VI and VII show the minimum error values obtained, test cost/IC, and total ICs rejected at wafer-probe for different yielding processes. Table VII shows that all the bad ICs were detected after the AP test; hence, the coverage of the test was 100%. In addition, a significant number of bad ICs were detected at the WP test level. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

322



S. Bhattacharya and A. Chatterjee

Fig. 17. Assembled package test waveform for Case Study II.

Fig. 18. Wafer-probe test waveform for Case Study II.

Figures 17 and 18 show the optimized test waveforms for AP and WP tests. 5.3 Case Study III This example was worked out using a down-conversion mixer. The operating frequency of the mixer was 900 MHz. This example is intended to show that this methodology can be applied to RF circuits also. While accurately tracking all the specifications, large numbers of bad ICs were eliminated after WP test. Tables VIII and IX show the minimum prediction error values obtained for predicting different specifications, test cost/IC, and total ICs rejected at waferprobe for different yielding processes. In this case also the specifications of the individual ICs were tracked. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Optimizing Wafer-Probe and Assembled Package Test Design



323

Table VIII. Accuracies Achieved for Specifications Specifications Total Harmonic Distortion Conversion Gain PSRR Noise Figure IIP3

Nominal Value −64.87 dB −7.56 dB −67.39 dB 4.53 dB 3.62 dBm

Minimum Error in Prediction (AP Test) 1.11% 0.52% 0.94% 0.63% 5.50%

Table IX. Test Cost and ICs Eliminated at WP Test Yield Good and bad ICs Bad ICs detected at WP test Bad ICs eliminated after AP test Savings in packaging cost Test cost/IC

89.4% 447 53 42 11 $21 $0.0419

Fig. 19. Assembled package test waveform for Case Study III.

Figures 19 and 20 show the optimized test waveforms for the assembled package test wafer-probe test. Contrasted to the previous two case studies, the assembled package test waveform duration was longer compared to the waferprobe test waveform. The device considered here was an RF mixer, and thus the wafer-probe test was not able to capture all the characteristics of the device under the signal limitations of wafer-probe test. 5.4 Case Study IV (Hardware Test) Case Study IV shows hardware test measurement data for a National Semiconductor LM318 operational amplifier. Test generation was performed for the LM318 operational amplifier based on circuit netlist and model parameter data obtained from the LM318 product datasheet (National Semiconductor LM318 ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

324



S. Bhattacharya and A. Chatterjee

Fig. 20. Wafer-probe test waveform for Case Study III.

Fig. 21. Assembled package test waveforms from hardware measurement.

Data Sheet; go to http://www.national.com), using, the Cadence Spectre simulation datasheet (UNIX Programmer’s Reference Manual). Altogether, 100 ICs were used in this experiment, 50 for test calibration and 50 for test validation. The test stimulus was applied to the circuit and captured using the PCI 6110E Data Acquisition Card (National Instruments Data Acquisition Card (NI PCI-6110) User Manual; go to http://www.ni.com/dataacquisition), from ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Optimizing Wafer-Probe and Assembled Package Test Design



325

Fig. 22. Wafer-probe test measurements made on the prototype.

Fig. 23. Hardware prototype used for validation purpose.

National Instruments. The test calibration involves the following steps: (1) Apply test waveforms to the circuit-under-test and measure responses using the right printed wiring board (PWB), shown in Figure 23. (2) Compute specifications using the left PWB, shown in Figure 23. (3) Build regression model M relating specifications and responses from the circuit. For validation, the obtained response was digitized using the same data acquisition card, and the test specifications of the LM318 were predicted from ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

326



S. Bhattacharya and A. Chatterjee Table X. Accuracies Achieved for Specifications Specifications Offset voltage Open loop gain CMRR

Nominal Value 5 mV 84.7 dB −65.56 dB

Minimum Error in Prediction (AP Test) 3.68% 2.64% 3.1%

Table XI. Test Cost and ICs Eliminated at WP Test Yield Good and bad ICs Bad ICs detected at WP test Bad ICs eliminated after AP test Savings in packaging cost Test cost/IC

100% 100 0 0 0 $0.0 $0.0019

Table XII. Cost Comparisons for Proposed Algorithm and Standard Specification Test (Case Study I) Yield WPTT (µs) APTT (µs) Test cost/ICa Test cost/ICb Savingsc in test cost

99% 2080 1370 $0.017 $0.035 2.04

91.6% 2620 1080 $0.027 $0.049 1.853

88.4% 3710 2110 $0.031 $0.056 1.797

a

Test cost using proposed approach. Test cost using standard specification test method. c Savings are unitless; they represents the ratio of test cost/IC using standard specification test to test cost/IC using proposed approach. b

Table XIII. Cost Comparisons for Proposed Algorithm and Standard Specification Test (Case Study II) Yield WPTT (µs) APTT (µs) Test cost/ICa Test cost/ICb Savingsc in test cost

98.6% 3600 2200 $0.018 $0.047 2.640

86.2% 4200 3000 $0.027 $0.090 3.340

a

Test cost using proposed approach. Test cost using standard specification test method. Savings are unitless; they represents the ratio of test cost/IC using standard specification test to test cost/IC using proposed approach. b c

the data so obtained using the regression model M , as discussed above in Step 3. Figures 21 and 22 show the input as well as the output waveforms for both WP and AP. The band-limited signals for WP can be seen in Figure 22. The test waveforms show accurate tracking of the specifications. Figure 23 show the prototypes that were built for specification measurement and response acquisition purposes. Tables X and XI show the specification values, error in prediction of the specifications, and the test cost. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Optimizing Wafer-Probe and Assembled Package Test Design



327

Table XIV. Cost Comparisons for Proposed Algorithm and Standard Specification Test (Case Study III) Yield WPTT (ns) APTT (ns) Test cost/ICa Test cost/ICb Savingsc in test cost

99.8% 2830 5120 $0.042 $0.147 3.513

a

Test cost using proposed approach. Test cost using standard specification test method. c Savings are unitless; they represents the ratio of test cost/IC using standard specification test to test cost/IC using proposed approach. b

Table XV. Cost Comparisons for Proposed Algorithm and Standard Specification Test (Case Study IV) Yield WPTT (ms) APTT (ms) Test cost/ICa Test cost/ICb Savingsc in test cost

100% 22 31 $0.019 $0.07 3.68

a

Test cost using proposed approach. Test cost using standard specification test method. c Savings are unitless; they represents the ratio of test cost/IC using standard specification test to test cost/IC using proposed approach. b

5.5 Cost Comparison with Standard Specification Test The algorithms were developed with MATLAB and Cadence tools, namely, Ocean and Spectre were used for circuit simulation. The code was executed on a SUN workstation. In a process line, there is no need to recalibrate the test waveform as long as the process is stable (with respect to time, temperature, humidity, and other environmental factors). Tables XII, XIII, XIV and XV show the test times, test costs, and a comparison of test costs for the circuits. In some cases, the test cost for the proposed method was almost four times less than that for the standard specification test method. 6. CONCLUSIONS The test generation method presented in this article shows promising results for analog and mixed-signal, even for RF circuits. For example, a mixer operating at RF frequencies was tested at IF frequency and still lot of bad ICs could be detected at the WP level. This methodology is capable of detecting up to 60% of the bad ICs at the WP test level. The rest of the bad ICs are detected at the AP ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

328



S. Bhattacharya and A. Chatterjee

test level, ensuring high fault coverage. Test times are considerably less than that of a conventional specification test; up to 70% of the test time is reduced using this technique. Finally, using this method, test costs can be significantly lowered compared to the standard specification test costs (up to 3.5×). REFERENCES ABDERRAHAMAN, A., CERNY, E., AND KAMINSKA, B. 1996. CLP-based multi-frequency test generation for analog circuits. J. Electron. Test. 9, 59–73. ALLEN, P. E. AND HOLBERG, D. R. 2002. CMOS Analog Circuit Design, 2nd ed. Oxford University Press, Oxford, U.K. BALIVADA, A., CHEN, J., AND ABRAHAM, J. A. 1996. Analog testing with time response parameters. IEEE Des. Test Comput. 13, 18–25. BHATTACHARYA, S. AND CHATTERJEE, A. 2002a. Constrained specification-based test stimulus generation for analog circuits using nonlinear performance prediction models. In Proceedings of the IEEE Workshop on Electronic Design, Test and Applications. 25–29. BHATTACHARYA, S. AND CHATTERJEE, A. 2002b. Wafer probe and assembled package test cooptimization for minimal test cost. In Proceedings of the International Mixed Signal Test Workshop. 15–20. BHATTACHARYA, S. AND CHATTERJEE, A. 2003. High coverage analog wafer-probe test design and co-optimization with assembled package test to minimize overall test cost. In Proceedings of the VLSI Test Symposium. BROCKMAN, J. B. AND DIRECTOR, S. W. 1989. Predictive subset testing: Optimizing IC parametric performance for quality, cost and yield. IEEE Trans. Semicond. Manufact. 2, 104–113. CHAIYARATANA, N. AND ZALZALA, A. M. S. 1997. Recent developments in evolutionary and genetic algorithms: Theory and applications. IEE genetic algorithms in engineering systems: Innovations and applications. Publication No. 446. 270–277. IEE, London, U.K. Web site: http://www.iee.org. FRIEDMAN, J. H. 1991. Multivariate adaptive regression splines. Ann. Stat. 19, 1–141. GOLDBERG, D. E. 1989. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA. GOMES, A. V. AND CHATTERJEE, A. 1999. Minimal length diagnostics tests for analog circuits using test history. In Proceedings of the Design Automation and Test in Europe (DATE) Conference. 189–194. GRAY, P. R., HURST, P. J., LEWIS, S. H., AND MEYER, R. G. 2001. Analysis and Design of Analog Integrated Circuits, 4th ed. John Wiley and Sons, New York, NY. MILOR, L. AND VISWANANATHAN, V. 1989. Detection of catastrophic faults in analog integrated circuits. IEEE Trans. Comput.-Aid. Des. 8, 114–130. NAGI, N., CHATTERJEE, A. BALIVADA, A., AND ABRAHAM, J. A. 1993. Fault based automatic test generator for linear analog devices. In Proceedings of the International Conference on Computer-Aided Design. 88–91. NETER, J., KUTNER, M. H., NACHTFHEIM, C. J., AND WASSERMAN, W. 1996. Applied Linear Statistical Models. WCB/McGraw-Hill, New York, NY. OPENBOOK. 1999. Openbook Cadence Spectre Simulation Handbook. Cadence, San Jose, CA. Web site: http://www.cadence.com. PRESS, W. H., TEUKOLSKY, S. A., VELLERLING, W. T., AND FLANNERY, B. P. 1998. Numerical Recipes in C: The Art of Scientific Computing, 2nd ed. Cambridge University Press, New York, NY. SALAMANI, M., KAMINSKA, B., AND QUESNEL, G. 1994. An integrated approach for analog circuit testing with minimum number of detected parameters. In Proceedings of the International Test Conference. 631–640. SOMA, M. AND DEVARAYANADURG, G. 1994. Analytical fault modeling and static test generation for analog ICs. In Proceedings of the IEEE International Conference on Computer-Aided Design. 44–47. TSAI, S.-J. 1991. Test vector generation for linear analog devices. In Proceedings of the IEEE International Test Conference. 592–597. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Optimizing Wafer-Probe and Assembled Package Test Design



329

VARIYAM, P. N. AND CHATTERJEE, A. 1997. Test generation for comprehensive testing of linear analog circuits using transient response sampling. In Proceedings of the International Conference on Computer Aided Design. 382–385. VARIYAM, P. N. AND CHATTERJEE, A. 1998. Enhancing test effectiveness for analog circuits using synthesized measurements. In Proceedings of the VLSI Test Symposium. 132–137. VOORAKARANAM, R. AND CHATTERJEE, A. 2000. Test generation for accurate prediction of analog specifications. In Proceedings of the VLSI Test Symposuim. 137–142. ZHENG, H. H., BALIVADA, A., AND ABRAHAM, J. A. 1996. A novel test generation approach for parametric faults in linear analog circuits. In Proceedings of the VLSI Test Symposium. 470–475. Received March 2003; revised August 2003, June 2004, September 2004; accepted October 2004

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Efficient Datapath Scheduling Using Multiple Voltages and Dynamic Clocking SARAJU P. MOHANTY University of North Texas and N. RANGANATHAN University of South Florida

Recently, dynamic frequency scaling has been explored at the CPU and system levels for power optimization. Low-power datapath scheduling using multiple supply voltages has been well researched. In this work, we develop new datapath scheduling algorithms that use multiple supply voltages and dynamic frequency clocking in a coordinated manner in order to reduce the energy consumption of datapath circuits. In dynamic frequency clocking, the functional units can be operated at different frequencies depending on the computations occurring within the datapath during a given clock cycle. The strategy is to schedule high-energy units, such as multipliers at lower frequencies, so that they can be operated at lower voltages to reduce energy consumption and the low-energy units, such as adders at higher frequencies, to compensate for speed. The proposed time- and resource-constrained algorithms have been applied to various high-level synthesis benchmark circuits under different time and resource constraints. The experimental results show significant reduction in energy for both the algorithms. Categories and Subject Descriptors: B.5.1 [Register-Transfer-Level Implementation]: Design—Data-path designs; B.5.2 [Register-Transfer-Level Implementation]: Design Aids—Automatic synthesis, optimization; B.7.1 [Integrated Circuits]: Types and Design Styles—VLSI (very large scale integration) General Terms: Algorithms, Performance, Design, Reliability Additional Key Words and Phrases: High-level synthesis, low-power datapath synthesis, multiple voltage scheduling, time-constrained scheduling, resource-constrained scheduling, dynamic frequency clocking

This research was carried out at the Department of Computer Science and Engineering, University of South Florida, Tampa, FL 33620. Authors’ addresses: S. P. Mohanty, Department of Computer Science and Engineering, University of North Texas, Denton, TX 76203; email: [email protected]; N. Ranganathan, Department of Computer Science and Engineering, University of South Florida, Tampa, FL 33620; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected].  C 2005 ACM 1084-4309/05/0400-0330 $5.00 ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005, Pages 330–353.

Energy-Efficient Datapath Scheduling



331

1. INTRODUCTION With the increase in demand for personal computing devices and wireless communications equipment, the demand for synthesizing low-power-consuming circuits has increased. The need for low-power synthesis is driven by several factors, such as [Pedram 1996] demand of portable systems (battery life), thermal considerations (cooling and packaging costs), environmental concerns (use of natural resources), and reliability issues. While energy consumption of a device has to be minimum to increase battery life, the energy-delay-product has to be minimized to increase battery life and to reduce delay, simultaneously. Let us consider the following equations for a CMOS circuit [Burd and Brodersen 1995; Pouwelse et al. 2001b]: — Energy dissipation per operation is 2 E = Ceff Vdd ,

(1)

where, Ceff is the effective switched capacitance and Vdd is the supply voltage. — For frequency f , the power dissipation for the operation is 2 P = Ceff Vdd f.

(2)

—Further, the critical delay (td ) in a device that determines the maximum frequency ( f max ) is as shown below, where VT is the threshold voltage, α is a technology-dependent factor, and k is a constant: Vdd td = k . (3) (Vdd − VT )α From the above three equations, the following can be deduced [Burd and Brodersen 1995; Pouwelse et al. 2001b; Pering et al. 2000; Martin and Siewiorek 2001]: — Reducing only Vdd , both energy and power can be saved at the cost of performance. — Slowing down the circuit by reducing only f will save power but not energy. — However, by scaling frequency and voltage in a coordinated manner, both energy and power can be saved while maintaining performance. The third factor above forms the major motivation for this work. The objective is to generate a datapath schedule that attempts energy and power reduction without degrading the performance by using multiple voltages and dynamic frequency in a coordinated manner. In this article, we consider the use of dynamic frequency clocking or frequency scaling along with multiple supply voltages for the synthesis of low-power datapath circuits useful for signal processing applications. In dynamic frequency clocking (DFC), the functional units could operate at different speeds during each clock cycle depending on the units active in that cycle. We develop two new datapath scheduling algorithms, one referred to as TC-DFC (time constrained) and other referred to as RC-DFC (resource constrained), both of which aim at reducing energy consumption. The resource constraints consist of the number and type of each functional unit, the allowed voltages, and the ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

332



S. P. Mohanty and N. Ranganathan

allowed frequencies. The time constraint is defined in terms of multiples of the critical path delay of the datapath circuit. RC-DFC minimizes the total energy consumption of the datapath circuit by maximizing the utilization of lower-supply-voltage resources from the given sets of resources operating at different supply voltages while reducing the time penalty. On the other hand, TC-DFC minimizes the total energy consumption of the datapath circuit without violating the timing constraints assuming that unlimited resources operating at different supply voltages are available. The scheduler will generate a parameter associated with each control step called the clock frequency index, denoted as cfic for control step c. This parameter is provided to the dynamic clocking unit called (DCU), which switches the clock based on that parameter. The article is organized as follows. Section 2 describes the prior work in this area and Section 3 describes the energy savings and performance improvement possible due to dynamic frequency clocking. Section 4 discusses the target architecture and frequency selection scheme. Sections 5 and 6 present the time-constrained scheduling and the resource-constrained scheduling algorithms. Section 7 describes the experimental results, and Section 8 gives our conclusions. 2. RELATED WORK We investigate the use of dynamic frequency clocking alongwith multiple supply voltages, as a means to lower power consumption. We develop high-level synthesis scheduling algorithms incorporating both multiple voltage and dynamic clocking. Thus, our discussion of related works will include two broad categories. First, we brief the works involving the use of frequency scaling in the design of general-purpose or multipurpose processor architectures, and later we discuss the low-power datapath scheduling works. Several approaches toward reducing power or energy consumption in both general-purpose and special-purpose processors have appeared in the literature. A dynamic voltage scaled microprocessor system was presented in Burd et al. [2000] and Pering et al. [2000], in which the frequency and the voltage levels for the processor core are determined by the operating system. A power-efficient compiler determines the voltage level and clock frequency at compilation time from high-level code in Hsu et al. [2000]. Similar to the approach in Burd et al. [2000], the authors in Pouwelse et al. [2001a] described a system for a low-power microprocessor using dynamic voltage scaling. In Pouwelse et al. [2001b], an energy-prioritized scheduler mediates between the application software and the operating system in determining the voltage and frequency levels for the CPU was described. In Grunwald et al. [2000], voltage and clock scheduling algorithms were incorporated within the operating system. In Martin and Siewiorek [2001], the authors described the system-level power-performance tradeoff for a variable-frequency processor system. In the above works, the suitable frequency and voltage at which the CPU core should be run, was determined either at the operating system level or at the compiler level. Thus, it is quite evident that simultaneous voltage and frequency scaling is becoming important in low-power processors. In the above ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Efficient Datapath Scheduling



333

works, frequency scaling was explored at the CPU and system levels, while we explore the use of dynamic frequency clocking within the datapath and datapath scheduling algorithms that can be incorporated into a datapath synthesis tool. Several low-power datapath scheduling techniques have been developed and reported in the literature. A scheduling algorithm using the “shutdown” technique, multiplexor reordering, and pipelining was described in Monteiro et al. [1996]. Scheduling and resource binding algorithms were described in Musoll and Cortadella [1995], which reduce the power by reducing the activity of functional units by minimizing transition of operands. In Lin et al. [1997], an ILP formulation and a heuristic for variable voltage scheduling was presented. An ILP-based datapath scheduling scheme using both multiple supply voltages and dynamic frequency clocking was described in Mohanty et al. [2003]. A scheduling algorithm called MOVER was presented in Johnson and Roy [1997] using an ILP formulation. A dynamic programming technique for multiple supply voltage scheduling was discussed in Chang and Pedram [1997]. A time-constrained multiple-voltage scheduling technique was proposed in Sarrafzadeh and Raje [1999]. A resource-constrained scheduling algorithm with multiple supply voltages was given in Kumar and Bayoumi [1999] which helps in reducing power using multiple supply voltages. In Shiue and Chakrabarti [2000], resource- and a latency-constrained list-based scheduling algorithms with multiple supply voltages were discussed. Resource- and time-constrained scheduling based on the Lagrange multiplier method was addressed in Manzak and Chakrabarti [2002]. The above scheduling techniques were based on a single clock frequency and considered multiple supply voltages, voltage scaling, capacitance reduction, and switching activity reduction. In this work, we consider the use of dynamic frequency clocking or frequency scaling along with multiple supply voltages in developing resource- and time-constrained low-power datapath synthesis scheduling algorithms. 3. DYNAMIC FREQUENCY CLOCKING AND ENERGY SAVINGS In this section, we discuss the concept of dynamic clocking frequency in brief. We also analyze the role of dynamic frequency clocking alongwith multiple supply voltages in reducing energy consumption while maintaining performance using a small example. In dynamic frequency clocking, the clock frequency is varied on-the-fly based on the functional units active in that cycle. In this clocking scheme, all the units are clocked by a single clock line which switches at runtime. The design and use of such a clocking mechanism was explored in several works [Kim and Chae 1996; Ranganathan et al. 1998; Brynjolfson and Zilic 2000a, 2000b; Benini et al. 1998, 1999]. In Ranganathan et al. [1998], the dynamic frequency clocking mechanism has been shown to improve the execution time as compared to using a unifrequency global clock. In Benini et al. [1998, 1999], a concept similar to dynamic clocking, called variable-latency telescopic unit was used to synthesize high-performance systems. Figure 1 shows the unifrequency and dynamic frequency diagrams. The dynamic clocking unit (DCU) generates the ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

334



S. P. Mohanty and N. Ranganathan

Fig. 1. (a) Single frequency; (b) dynamic frequency.

Fig. 2. Scheme for dynamic frequency generation.

required clock frequency utilizing a clock divider strategy to generate frequencies which are submultiples of the base frequency. The base frequency f base is the maximum frequency (or multiple of maximum) of any functional unit (FU) at the maximum supply voltage. A value cfic is loaded as an input to the DCU which comes from the controller. The scheme for dynamic frequency generation is shown in Figure 2. The clock is determined by dividing the base frequency base . by the cfic value, fcfi c As discussed in Section 1, with reference to the Equations (1)–(3), frequency scaling helps in reducing power, but not energy [Pering et al. 2000; Burd and Brodersen 1995]. The frequency reduction creates an opportunity to operate the different functional units at different voltages, which in turn helps in energy reduction. With the help of an example, we illustrate how dynamic frequency clocking or frequency scaling can be helpful in energy reduction while maintaining performance. Let us consider the example data flow graph (DFG) shown in Figure 3. Let ta and tm be the delays of the adder and the multiplier, respectively, at the maximum supply voltage V . The DFG has a schedule of three control steps. Let us consider three possible modes of operation, such as (i) single-supply voltage and single frequency, (ii) multiple-supply voltage and single frequency, and (iii) multiple-supply voltages and dynamic frequency. It may be noted that, in case (ii), the energy overhead of the level converters has to be taken into account. Similarly, energy overhead of level converters and that of DCU need to be considered in case (iii). (i) Single-supply voltage and single frequency. Each cycle has a clock width dictated by the slowest operator delay tm . The total energy consumption is given by E S = 2Em + 2Ea and the total delay is TS = 3tm . ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Efficient Datapath Scheduling



335

Fig. 3. Example DFG. − and Ea− be energy (ii) Multiple-supply voltage and single frequency. Let, Em consumption values for multiplier and adder, respectively, when operating + at a lower voltage V − . At this supply voltage, let tm and ta+ be the delay of − multiplier and adder, respectively. It is evident that Em and Ea− are smaller + + than Em and Ea , and tm and ta are larger than tm and ta , respectively. In this case, the clock frequency will be driven by the larger delay, the delay of multiplier. Thus, the energy consumption of the DFG is given by − + E M = Em + Ea + Em + Ea− , and the total delay is, tM = 3tm . Since E M < E S and TM > TS , in other words, the energy savings comes at the cost of a time penalty. (iii) Multiple-supply voltages and dynamic frequency. In this case, the total − energy consumption of the DFG is given by E D = Em + Ea + Em + Ea− (same as the E M in the case (ii)). Moreover, in this case we have a variable + , max(tm , ta+ ), and clock. So, the delay for cycle 1, cycle 2, and cycle 3 are tm + ta , respectively. Thus, the total delay is TD = tm + max(tm , ta+ ) + ta . It is + + , and max(tm , ta+ ) < tm , so TD < TM . Since, E D = obvious that ta < tm E M and TD < TM , we conclude that the time penalty has been reduced compared to case (ii) for same energy reduction. Now, let us compare this with case (i). Since, the delay of an adder is much less than the delay of a multiplier, without loss of generality we can assume that max(tm , ta+ ) ≈ tm , + + and also that tm + ta ≈ 2tm (as tm > tm and ta < tm ); it is possible that TD ≈ 3tm = TS . Thus, we have E D < E S and TD ≈ TS ; in other words, energy reduction is achieved without degrading performance.

4. TARGET ARCHITECTURE AND DATAPATH SPECIFICATIONS The target architecture model assumed for the scheduling schemes is shown in Figure 4. Each functional unit feeds one register and has a multiplexor also. The register and the multiplexor operate at the same voltage level as that of the functional units. Level converters are used when a low-voltage functional unit is driving a high-voltage functional unit [Johnson and Roy 1997; Shiue and Chakrabarti 2000]. A controller decides which functional units are active in each control step and the inactive ones are disabled using the multiplexors. The controller has a storage unit to store the parameters cfic obtained from the scheduling. The cycle frequency f c is generated dynamically using DCU and a ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

336



S. P. Mohanty and N. Ranganathan

Fig. 4. Level converters needed for stepping up signal.

functional unit operating at one of the supply voltages is activated. It may be noted that while the level converters are many and internal to the datapath circuit, the DCU is only one and external to it, and both are considered as overheads for multiple-voltage and dynamic clocking-based design. The datapath is specified as a sequencing data flow graph (DFG) [Micheli 1994]. Each vertex of the DFG represents an operation and each edge represents dependency. In this work, we are considering the signal processing applications, in which the dynamic frequency clocking scheme is useful for energy reduction. So, we assume that the datapath circuit is represented as a directed acyclic DFG. The DFG does not support the hierarchical entities and the conditional statements are handled using comparison operation. Each vertex has attributes that specify the operation type. The delay of a control step is dependent on the delays of the functional unit and the multiplexer and register pairs. Let, d reg be the delay of the register, d mux be the delay of the multiplexor, d fu be the delay of the functional unit, and d level be the delay of the level converter. The worst-case operational delay of a functional unit can be written as d FU = d reg + d mux + d fu + d level .

(4)

The register delays include the set-up and propagation delays. The delay of control step d c is the delay of the slowest functional unit in the control step c. Using the above delay model, the worst-case delays of the library components are estimated. For a given base frequency ( f base ), maximum frequencies of each FU are scaled down to operating frequencies given by base ( fcfi ), where, cfic = 1, 2, . . . , any positive integer. In general, the value of cfic c is bounded by the total number of resources raised to the power number of frequency levels. Assuming two resources, such as multiplier (MULT ) and arithmatic logic unit (ALU), for three frequency levels, the possible frequencies are ALUHigh (cfic = 1), ALUMed (cfic = 2), ALULow (cfic = 4), MULTHigh (cfic = 2), MULTMed (cfic = 4), and MULTLow (cfic = 8). For example, if the base frequency fed to the DCU is 36 MHz, then the frequencies generated are, 18 MHz, 9 MHz, and 4.5 MHz. The clock frequency for a given control step is the minimum of the operating frequencies of all FUs active in that step. 5. TIME-CONSTRAINED SCHEDULING The objective is to the minimize the energy consumption without violating the timing constraint while the resources are operating at different supply voltages and are available in unlimited numbers. The inputs to the algorithm are an ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Efficient Datapath Scheduling



337

Fig. 5. TC-DFC Scheduling algorithm flow.

unscheduled DFG, the scaled-down operating frequencies, and the execution time constraint Tc for the whole schedule. The output produced by the algorithm are the scheduled DFG, the voltage assignment for each node, the cycle frequency indices, and energy estimates. To get more energy savings and at the same time maintain performance, the multipliers are to be operated at as low frequencies as possible and the adders at as high frequencies as possible. This objective can be achieved if adders or subtractors are not operated along with multipliers in the same duty cycle. In the cases when they are to be operated during the same cycle to meet the time constraint, energy savings will come from the multipliers only. Initially, TC-DFC generates a schedule such that the low-frequency operators are scheduled at earlier steps and the high-frequency operators are scheduled at later steps. Later on, the TC-DFC modifies the schedule by moving operations from one step to another with the objective of meeting the time constraint. It then finds the appropriate clock cycle width and assigns the appropriate voltage. 5.1 TC-DFC Algorithm TC-DFC scheduling algorithm is presented in brief in Figure 5. In Step 1, an as-soon-as-possible (ASAP) schedule for the given input unscheduled data flow graph is determined. In Step 2, the scheduler creates a priority list of the vertices in which higher priority is given to the vertices which are to be scheduled at earlier control steps. This priority approach ensures the scheduling of energy-hungry resources at earlier control steps and other resources at later control steps, and avoids their concurrent operations. The priority list is created as follows: all multiplications (i.e, low-frequency operations) are grouped with higher priority than the ALU operations (i.e., high-frequency operations, such as additions, subtractions, etc.). Among the multiplication operations, higher priority is given to the operations with a smaller ASAP time stamp. Similarly, among the ALU operations, higher priority is assigned to the operations with a small ASAP time stamp. In Step 3, the vertices are time stamped in an ASAP ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

338



S. P. Mohanty and N. Ranganathan Table I. TC-DFC Frequency Selection : From Left → Right in Each Step

Frequency cfic

MULTLow 4.5 MHz 8

MULTMed /ALULow 9 MHz 4

MULTHigh /ALUMed 18 MHz 2

ALUHigh 36 MHz 1

manner using the vertex priority list such that no multiplication and ALU operations are scheduled to function concurrently. Moreover, it is ensured that operation precedence is satisfied and higher-priority vertices are scheduled at earlier time stamps. In Step 4, for the current schedule, the cycles are categorized as cycles having only ALU operations, only multiplication, and both ALU operations and multiplication (mixed operations). It may be noted that the aim of the scheduling is to avoid the concurrent scheduling of ALU operations (use low-energy resources) and multiplication operations (use high-energy resources) as much as possible, but it may not be possible when the time constraint is very strict. In Step 5, a priority list of clock cycles is created such that the higher-priority cycles are preferred candidates for higher-frequency assignment. The cycle priority list is created as follows. The cycles with only ALU operations get higher priority than the cycles with only multiplications and the cycles with mixed operations. The cycles with only multiplications get higher priority than the cycles with mixed operations. Further, among the cycles with only ALU operations, higher priority is given to the cycles having the lesser number of ALU operations. Similarly, among the cycles with only multiplication operations, higher priority is given to the cycles having the lesser number of multiplication operations. However, among the cycles with mixed operations, higher priority is given to the cycles having a lesser number of multiplications. In Step 6, the initial cycle frequency is assigned as the left-most operating frequency from Table I. It should be noted that Table I shows two types of resources, such as multilpiers and ALU, and three frequency levels for each; however, it can be extended to accomodate more than two types of resources and frequency levels in a similar manner. In Step 7, in order to fulfil the time constraint, the frequency of highest-priority cycle is increased using Table I. If needed, the process is repeated for the next higher-priority cycle. The repetition is necessary if it is found that the maximum frequency assignment for the highest-priority cycle did not satisfy the time constraint. In Step 8, if it is found that all cycles with multiplication (low-priority cycles) are operating at the highest frequency to satisfy the time constraints then the cycle having minimum number of ALU operations is eliminated and the schedule is adjusted. This is necessary due to the fact that, if the multipliers (energy hungry resources) operate at highest frequency, there will not be any energy reduction. The adjustment involves reducing the time stamp of the vertices scheduled in the cycle to be eliminated and the time stamp of its successors and predecessors. In Step 9, voltage assignment is done and the energy estimate for the entire DFG is found out. At this step, the minimum allowable voltage that meets the cycle frequency is assigned. In Step 10, the cycle frequency index for each cycle is based on Table I. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Efficient Datapath Scheduling



339

Fig. 6. Pseudocode for TC-DFC scheduling algorithm.

A detailed representation of the above algorithm in the form of pseudocode is given in Figure 6. The list of functions needed in implementation of the algorithm is given in Table II. Similarly, the data structures or the identifiers used in the algorithm description are summerized in Table III. We now explain the working of the above algorithm using HAL benchmark DFG. We start with an unscheduled HAL DFG from Micheli [1994]. Using Step 1, the ASAP time stamps are assigned and the ASAP scheduled DFG shown in Figure 7 is obtained. The vertex priority list for this DFG is given in Table IV as created in Step 2. Then in Step 3, using this vertex priority list another schedule is obtained, which is the DFG in Figure 8(a) without voltage or cfic assignment, in which no multipliers or ALU operations are scheduled concurrently. For this DFG, the cycle priority list shown in Table V is obtained using Step 4. Using this cycle priority list, frequency assignments are done in the Step 6 and Step 7 to meet the time constraints, say, Tc ≈ 2 ∗ Tcp . Finally, Step 9 and Step 10 do the voltage assignment and cycle frequency index calculation, respectively, and the final DFG shown in Figure 8(a) is obtained. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

340



S. P. Mohanty and N. Ranganathan Table II. List of Functions Used in the TC-DFC Algorithm

Functions ASAPScheduler: CreateVertexPriorityList:

Description Determines the ASAP time stamp of the vertices. Creates a priority list of vertices such that the vertex with lower operating frequency gets the higher priority. TOP: Finds the first vertex from the priority list array CheckFrequencyConstraint: Checks the frequency constraint in a control step Max: Finds the maximum value from an array. CreateCyclePriorityList: Constructs the cycle priority list in an array. InitializeFrequency: Initializes the operating frequency of each cycle CalculateDelay: Calculates the critical path delay using CycleFrequencyList FindNextFrequency: Finds the next available frequency. FindCycleWithMinimumALU: Finds the control step with the minimum number of ALU operations. Adjust Predecessor: Adjusts time stamp of predecessor. Adjust Successor: Adjusts time stamp of successor. Voltage Assignment: Assigns voltage to each vertex. Find Cycle Frequency Index: Finds cycle frequency indices of all cycles.

Complexity  (|V | + |E|)  (|V |)

 (1)  (1)  (c)  (c)  (L f )  (c) O(L f )  (c R T ) O(|V |) O(|V |)  (|V |)  (c)

Table III. List of Variables and Data Structures Used in the TC-DFC Algorithm Description Data Structures ASAPSchedule: TC-DFCSchedStep: ScheduledVertexList: VertexPriorityList: CyclePriorityList: TC-DFCNoOfSteps: CycleFrequencyList: cycle:

Descriptions An array used to store ASAP time stamp of each vertex An array used to store TC-DFC time stamp of each vertex An array used to store vertices already scheduled An array used to store vertices in a priority order An array used to store control steps in a priority order Total number of control steps of TC-DFC schedule An array used to store frequency of each cycle A temporary variable

Similarly, the time constraint, Tc ≈ 1.75 ∗ Tcp could be met using the same cycle priority list (Table V) with higher frequency assigned to the next priority cycle. Say now we want a schedule with time constraint Tc ≈ 1.5 ∗ Tcp . Using the cycle priority list (Table V), Step 7 attempts assigning a frequency. It is found that the cycle with multipliers scheduled are in highest operating frequency to meet such a constraint. So, using Step 8, cycle 5 is eliminated from DFG in Figure 8(a) without the voltage or cfic assignment that was obtained in Step 3 before. The new DFG is the DFG shown in Figure 8(c) without voltage or cfic assignment. For this DFG, we obtain the cycle priority list shown in Table VI. Using the Steps 6, 7, 9, and 10 as above, we obtain the final scheduled DFG shown in Figure 8(c). 5.2 TC-DFC Time Complexity Let there be |V | number of vertices and |E| number of edges in the DFG. Suppose the number of control steps found out from the ASAP scheduling is c. Let Lf denote the number of frequency levels and R T denote the number of resource types. Based on the time complexity of the different functions given ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Efficient Datapath Scheduling



341

Fig. 7. HAL differential equation solver benchmark DFG with ASAP labels (Step 1). Table IV. Vertex Priority List for HAL DFG (Step 2) v0 0

v1 1

v2 2

v6 3

v8 4

v3 5

v7 6

v10 7

v9 8

v11 9

v4 10

v5 11

v12 12

in Table II, we provide the following analysis for the worst-case running time of the TC-DFC algorithm. Time taken by the instruction from line 01–02 is  (|V | + |E|) +  (|V |). The running time of the code-segment line 03–09 is  (c|V |). Similarly,  (c) +  (Lf ) is the running time of the code-segment line 10–13. The while loops in line 14 and 16 terminate when the time constraint is satisfied, which involves a search in the frequency selection table. So the number of times these while loops are executed is independent of the input size |V | or |E|. Thus, the time complexity of the code segment in line 14–31 is  (c R T ) +  (|V |) +  (Lf ) +  (c) +  (c) +  (Lf ), which is same as writing (from an algorithm complexity point of view)  (c R T ) +  (|V |) +  (Lf ) +  (c). Without loss of generality, we can assume that the R T , Lf , and c are upper bounded by the number of vertices |V |. Using this assumption, the overall running time of the algorithm is expressed as  (|V | + |E|) +  (|V ||V |). For strong data dependency, we have |E| ≈ |V |2 and for weak data dependency |E| ≪ |V |2 . In either case, the simplified time complexity of the TC-DFC scheduling algorithm is |V |2 ; in other words, the time complexity is polynomial to the number of vertices in the data flow graph. 6. RESOURCE-CONSTRAINED SCHEDULING In the resource-constrained algorithm, the objective is to minimize energy consumption by maximizing the utilization of low supply voltage resources from a given set of resources operating at different supply voltages while reducing the time penalty as much as possible. The combined reduction of energy ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

342



S. P. Mohanty and N. Ranganathan

Fig. 8. Schedules obtained for HAL Benchmark for different time constraints using TC-DFC. Table V. Cycle Priority List for HAL DFG :Tc ≈ 2 ∗ Tcp or 1.75 ∗ Tcp (Step 4) Cycles Priorities

c5 0

c4 1

c3 2

c2 3

c1 4

c6 5

c0 6

consumption and time penalty translates to a reduction of the energy-delayproduct. Thus, the objective of RC-DFC is to minimize the energy-delay-product while assigning a schedule for the DFG. For a resource i operating in clock step c, let, (i) αi,c be the switching, (ii) Ci,c be the load capacitance, and (iii) Vi,c be the operating voltage. If a level converter is needed, it is considered as a resource needed in the particular clock cycle in which it needs to step up the signal. If N ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.



Energy-Efficient Datapath Scheduling

343

Table VI. Cycle Priority List for HAL DFG Tc ≈ 1.5 ∗ Tcp (Step 4 After Step 8) Cycles Priorities

c4 0

c3 1

c2 2

c1 3

c5 4

c0 5

Table VII. Frequency Selection: From Left to Right in Each Step FUs in a Cycle MULT — MULT ALU — ALU

Frequency Priority Order MULTLow , MULTMed , MULTHigh MULTLow , ALULow , MULTHigh ALUHigh , ALUMed , ALULow

Table VIII. Resource Lookup Table (Order, from Left to Right) Clock Cycle c

2.4 V 1

MULT 3.3 V 2

5.0 V 1

5.0 V 1

ALU 3.3 V 1

2.4 V 0

is the total number of clock cycles for the DFG, N Rc is the number of resources active in cycle c, and f c is the cycle frequency, then the total energy consumption of the DFG E M and the energy-delay-product EDP D is characterised by Equation (5). The inputs to the algorithm are an unscheduled DFG, the resource constraints which include the number of resources, their corresponding operating voltages, and the scaled down operating frequencies. The algorithm generates various outputs, such as scheduled DFG with node voltages assigned, cycle frequency indices, energy, delay estimates, and energy-delay-product estimates. ED =

NRc N  

2 αi,c Ci,c Vi,c ,

c=1 i=1

EDP D = E M ∗ TD =



NRc N   c=1 i=1

2 αi,c Ci,c Vi,c





N  1 . f c=1 c

(5)

RC-DFC attempts to operate the multipliers at as low a frequency as possible; the resulting decrease in performance is compensated by operating the ALUs at as high a frequency as possible. Depending on which functional units are active in a given cycle, the algorithm determines the frequency using a LUT, called the frequency selection LUT, such as the one shown in Table VII scanning it left to right. In a schedule, if only multipliers are needed in a particular cycle, the frequency selection is in the order MULTLow , MULTMed , MULTHigh . If both multipliers and the ALUs are all operating in a given clock cycle, the frequency selection is in the order MULTLow , ALULow , MULTHigh . If only ALUs are operating in a control step, then the frequency selection is in the order ALUHigh , ALUMed , ALULow . Another lookup table, called the resource assignment LUT, is constructed considering the resource constraints and the table is used to match the selected frequency with a corresponding voltage level. The resources are assigned scanning the LUT from left to right. Moreover, the scheduling algorithm uses heuristics to minimize the number of level conversions needed. An example of a resource assignment LUT is shown in Table VIII with the following ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

344



S. P. Mohanty and N. Ranganathan

Fig. 9. RC-DFC scheduling algorithm flow.

resource constraints: one MULT at 2.4V , two MULT at 3.3V , one MULT at 5.0V , one ALU at 3.3V , and one ALU at 5.0V . The arrangement of the MULTs is in the order from low to high voltage, whereas for the ALUs it is from high to low. The LUT is updated during each assignment to make sure that the resource constraints are not violated. The dimensions of the LUTs in Table VII and Table VIII depend on the total number of clock cycles of the schedule and/or the number of resource types. The tables can be extended to accomodate more types of resources, and different voltage and frequency levels. It has to be ensured that the arrangement of the energy-hungry resources is in the order from low to high voltage, whereas for the lesser energy-consuming resources it is from high to low. 6.1 RC-DFC Algorithm Figure 9 shows the flow of the proposed algorithm. In Step 1, the scheduler determines the ASAP and the ALAP schedules for the UDFG. For example, if the resource constraint is 2 ALUs at 2.4V , 1 ALU at 3.3V , 1 multiplier at 2.4V , and 3 multipliers at 5.0V , then the number of ALUs is three and the number of multipliers is four. In this step, the time constraint for ALAP schedule is obtained from ASAP schedule. In Step 2, the total number of resources is found as the sum of each resource at different voltage levels. In Step 3, the ASAP and ALAP schedules of Step 1 are modified using the number of resources found in Step 2 so that the resource constraints are not violated. In this process, the mobility of the vertices is restricted to great extent and the search space for the following steps reduced. In Step 4, the total number of control steps for both the ASAP and ALAP schedules are found, and the number of control steps for the final steps is assumed to be the maximum of the two. This step is necessary due to the fact that the total number of clock cycles may be different from that of original ASAP or ALAP schedule, and also they may not remain the same for both in the process of satisfying the stringent resource constraints. In Step 5, the resource assignment LUT and frequency selection LUT are constructed. The resource assignment LUT is constructed (similar to Table VIII) whose size depends on the number of control steps, the number of ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Efficient Datapath Scheduling



345

Fig. 10. Pseudocode for RC-DFC scheduler.

resource types, and the number of voltage levels. In Step 6, the vertices having nonzero mobility (with different ASAP and ALAP time stamps) and the vertices with zero mobility (with the same ASAP and ALAP time stamps) are found and the current schedule is initialized as the ASAP schedule obtained in Step 3. In Step 7, voltage and frequency assignments are made for the current schedule using the LUTs. This steps returns two lists: one containing the assigned voltage of each vertex and the other containing the selected frequency for each cycle. In Step 8, the scheduler finds a proper step for each vertex having nonzero mobility such that the energy-delay-product of the whole DFG is minimum. In Step 9, the current schedule and resource assignment LUTs are adjusted to satisfy the precedence. In Step 10, cycle frequency indices are found for all cycles which would be stored in the controller and would be fed to the DCU for clock generation. The algorithm terminates once all nonzero mobility vertices are scheduled. The pseudocode for the algorithm is shown in Figure 10. The list of functions needed in the implementation of the algorithm is given in Table IX. Similarly, the data structures or the identifiers used in the algorithm description are summarized in Table X. It should be noted that the algorithm can easily be extended to handle more than two types of resources; in such a scenario the dimension of resource assignment LUT and frequency selection LUT are going to change. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

346



S. P. Mohanty and N. Ranganathan Table IX. List of Functions Used in the RC-DFC Algorithm

Functions ASAPScheduler: ALAPScheduler: ModifySchedule:

Description Determines the ASAP time stamp of the vertices. Determines the ALAP time stamp of the vertices. Modifies the unconstrained schedules to incorporate voltage-relaxed resource constraints. ConstructResAssignmentTable: Constructs resource assignment LUT. Max: Finds maximum of control steps. FindResTypeForEachVertex: Identifies the FU needed for the operation at each vertex. ConstructFreqSelectionLUT: Constructs the frequency selection LUT. FindMobileVertexList: Find the mobility of each vertex. AllocateVoltAndFreq: Allocates the voltage level and frequency levels using LUTs and current schedule. CalculateEDP: Calculates energy-delay-product of the whole DFG AdjustSchedule: Adjusts predecessor and successor time stamps such that precedence is satisfied for a particular vertex. Update Resource Assign. LUT: Constructs the resource assignment LUT. FindEnergyAndDelay: Determines energy consumption and delay. FindCycleFreqIndex: Finds cycle frequency indices of all cycles.

Complexity  (|V | + |E|)  (|V | + |E|)  (|V | + |E|)

 (cLv R T )  (1)  (|V |)  (Lf )  (|V |)  (c|V |Lv R T )  (|V |) O(|V |)

 (1)  (|V |)  (c)

Table X. List of Variables and Data Structures Used in the RC-DFC Algorithm Description Data Structures ASAPSchedule: ALAPSchedule: CurrentSchedule: TempSchedule: MULT: ALU: ASAPControlSteps: ALAPControlSteps: NoOfControlSteps: ResAssignmentLUT: FreqSelectionLUT: max, start, end, cycle: CurrentEDP, TempEDP, ExtraEDP: CurrentVertex, CurrentCycle: VoltageArray: FrequencyArray: ZeroMobilityVertexList: NonZeroMobilityVertexList:

Descriptions An array used to store ASAP time stamp of each vertex An array used to store ALAP time stamp of each vertex An array used to store current schedule time stamp An array used to store temporary schedule time stamp Number of multipliers at all voltage levels Number of ALUs at all voltage levels Total number of control steps of ASAP schedule Total number of control steps of ALAP schedule Number of control steps of the schedule Resource assignment lookup table Frequency selection lookup table Temporary variables Temporary variables Temporary variables An array used to store operating voltage for each vertex An array used to store operating frequency for each cycle An array storing the vertices with zero mobility An array storing the vertices with nonzero mobility

Moreover, the multiplier will be replaced with the highest energy-consuming resource, the ALU will be replaced with the lowest energy-consuming resource; and others will fall in between. A final schedued DFG obtained using this algorithm is shown in Figure 11 for the resource constraint (one MULT at 2.4V, one MULT 3.3V, one ALU at 3.3V, and one ALU at 5.0V ). ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Efficient Datapath Scheduling



347

Fig. 11. Final schedule of FIR filter DFG (using RC-DFC).

6.2 RC-DFC Time Complexity Let there be |V | number of vertices and |E| number edges in the DFG, out of which |Vm | number of vertices have mobility and the maximum mobility of any mobile vertex is tm . Let Lv denote the number of voltage levels and Lf denote the number of frequency levels. Suppose the number of control steps found from the ASAP scheduling is c. Assumming that L V and Lf are upper bounded by |V |, the running time of the code segment from line 01–07 is  (|V | + |E|) +  (cL V R T ). The time complexity of the instruction in line 11–19 is  (c|V |L V R T |Vm |tm ). The code-segment line 09 to 19 has running time  (c|V |L V R T |Vm |tm ) +  (|V |) +  (c|V |L V R T ) =  (c|V |L V R T |Vm |tm ). The running time of the code segment line 08–19 is  (c|V |L V R T |Vm |2 tm ). The time complexity of line 20–25 is  |V | +  (c|V |L V R T ) +  (c) =  (c|V |L V R T ). ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

348



S. P. Mohanty and N. Ranganathan Table XI. Resource Constraints Used for Performing our Experiments Multipliers 3.3 V 5.0 V 2 1 3 0 2 0 1 1

ALUs 3.3 V 1 1 0 0

5.0 V 1 1 2 2

Assigned Serial No. 1 2 3 4

So the running time of the overall algorithm is  (|V | + |E|) +  (cL V R T ) +  (c|V |L V R T |Vm |2 tm ) +  (c|V |L V R T ) =  (|V | + |E|) +  (c|V |L V R T |Vm |2 tm ). Assuming that |E| is upper bounded by |V |2 and |Vm | is upper boounded by |V |, the above expression can be simplified to O(c|V |3 L V R T tm ). 7. EXPERIMENTAL RESULTS Both RC-DFC and TC-DFC schedulers were implemented in C and tested with selected benchmark circuits. The benchmarks used were as follows: (1) (2) (3) (4) (5) (6)

Auto-regressive (ARF) filter [Antola et al. 1998], Band-pass filter (BPF) [Papachristou and Konuk 1990], Elliptic-wave filter (EWF) [Kollig and Al-Hashimi 1997], DCT [Fetweis et al. 1993], FIR filter [Kumar and Bayoumi 1999], and HAL differential equation solver [Micheli 1994].

The FUs used were ALUs and multipliers. The energy values were computed using the datapath components given in Mohanty and Ranganathan [2003] and Mohanty et al. [2002]. The following notations are used to express the results: (i) E S and E D are the total energy consumption (in picojoules) for single-supply voltage and multiple-supply voltage operations, respectively. (ii) EDP S and EDP D are the energy-delay-products (in 10−18 Js) for singlesupply voltage and single frequency and for multiple-supply voltage and dynamic clocking operations, respectively. (iii) TS and TD are the corresponding delays (in nanoseconds) for the two modes of operations. (iv) N S denotes the number of clock steps of the schedule for single-supply voltage and and single-frequency operations. (v) N D is the equivalent clock steps of TD found taking the delay of the slowest functional unit as the base clock width in the case of multiple-voltage operation. D) The percentage energy savings is calculated as S E = (E SE−E ∗ 100. In a simiS lar manner, we calculated the percentage reduction in EDP, which is denoted as SEDP . For the RC-DFC scheduler, the experimental setup was as follows. The algorithm was tested using the different sets of resource constraints listed in Table XI. The experimental results for various benchmark circuits are reported

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Efficient Datapath Scheduling



349

Table XII. Energy or EDP Estimates for Different Benchmarks Using RC-DFC Scheduler

1 A R F 2 B P F 3 E W F 4 D C T 5 F I R 6 H A L

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Energy Estimates (pJ) ES ED SE 36168 21768 40 36168 18205 50 36168 19065 47 36168 27617 24 27654 16491 40 27654 14175 49 27654 14827 46 27654 20172 27 19404 10802 44 19404 10802 44 19404 10853 44 19404 11922 39 30675 17846 42 30675 17846 42 30675 18008 41 30675 18008 41 18678 9979 47 18678 9979 47 18678 10126 45 18678 10127 46 13596 8927 34 13596 6433 53 13596 6648 51 13596 10211 25

Energy-Delay-Product (10−18 Js) EDP S EDP M EDP D SEDP 20093 24186 19954 17 1 20093 20227 16688 17 17 20093 21183 18006 15 10 26121 44877 31452 29 NA 13827 16490 14659 11 NA 13827 14174 12600 11 9 13827 14827 12356 16 11 26118 42864 23253 45 11 17248 19203 12902 32 25 17248 19203 12902 32 25 17248 19293 11154 42 35 29106 40235 17055 57 41 25547 29743 26274 11 NA 25547 29743 26274 11 NA 25548 30013 25511 14 0 49392 65278 37267 42 25 11414 12196 6653 45 42 11414 12196 6653 45 42 11414 12377 6470 47 43 15565 18987 12096 36 22 3021 3967 2728 31 10 3021 2859 1966 31 35 3021 2954 2401 18 21 3777 6382 4396 31 NA

NS 10 10 10 13 9 9 9 17 16 16 16 27 15 15 15 29 11 11 11 15 4 4 4 5

Time Estimates TS TD 556 917 556 917 556 944 722 1139 500 889 500 889 500 833 944 1153 889 1194 889 1194 889 1028 1500 1431 833 1472 833 1472 833 1416 1611 2069 611 667 611 667 611 639 833 1194 222 306 222 306 222 361 278 431

ND 9 9 9 10 8 8 8 10 11 11 10 12 14 14 13 17 7 7 6 10 3 3 4 4

in Table XII. The energy estimation includes the energy consumption of the overhead units. It is assumed that each resource had equal switching activity. The results are reported for two supply voltages and for switching = 0.5. It is observed that the energy consumption was increased for higher switching and decreased for lower switching activity, but, under the assumption that switching was same for each resource, the percentage energy savings was not affected. There were very few resource constraints for which there was no reduction in the energy-delay-product for some benchmarks (reported as NA in Table XII ). The reduction in the energy-delay-product SEDP is shown in two columns. The first column represents the reduction EDP D with respect to EDP M and the other column shows the reduction EDP D with respect to EDP S . In all benchmarks and for almost all resource constraints, it is observed that EDP D < EDP S < EDP M , which justifies the fact stated in Section 3 with a motivating DFG that using multiple-supply voltages and dynamic frequency energy reduction was achieved without degrading performance. We also conducted experiments with three supply voltage levels and found that the percentage energy savings could only increase by 5%. Figure 12(a) shows the percentage savings (average S E ) averaged over all resource constraints. From the chart it is evident that the scheduling yields approximately equal savings for all kinds of benchmark circuits. The EDP reduction (average SEDP ) averaged over all resource constraints is shown in Figure 12(c). From the above, we may conclude that the scheduling algorithm yields appreciable energy savings and EDP reduction. In order to find the right combination of the types and the number of resources that will yield the best results in terms of energy reduction and high performance, we plotted energy consumption (%) versus time ratio ( TTDS ), which is nothing but the the configuration corresponding ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

350



S. P. Mohanty and N. Ranganathan

Fig. 12. Average energy and EDP reduction for benchmarks. Table XIII. Configurations for Minimum EDP Using RC-DFC Benchmark Circuits ARF BPF EWF DCT FIR HAL

Multipliers 3.3 V 5.0 V 3 0 2 0 2 0 1 1 2 0 3 0

ALUs 3.3 V 1 0 0 0 0 1

5.0 V 1 1 1 1 2 1

to maximum SEDP . Based on this analysis, the processor configurations that yielded the lowest execution times for each benchmark are listed in Table XIII. The TC-DFC scheduler was tested for three different time constraints: 1.5, 1.75, and 2.0 times the critical path delay (Tcp ). The voltage constraint was relaxed, unlike the RC-DFC. The results for various benchmark circuits are reported in Table XIV. Figure 12(b) shows the chart indicating the energy savings for different benchmarks averaged over all time constraints. Our observation is that circuits which require an equal number of ALU-related operations ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Efficient Datapath Scheduling



351

Table XIV. Energy Savings Using TC-DFC Scheduler Benchmark Circuits (1) ARF

(2) BPF

(3) EWF

(4) DCT

(5) FIR

(6) HAL

Time Constraints 1.5Tcp 1.75Tcp 2.0Tcp 1.5Tcp 1.75Tcp 2.0Tcp 1.5Tcp 1.75Tcp 2.0Tcp 1.5Tcp 1.75Tcp 2.0Tcp 1.5Tcp 1.75Tcp 2.0Tcp 1.50Tcp 1.75Tcp 2.0Tcp

Energy Consumption and Savings E S (pJ) E D (pJ) S E (%) 36186 21491 41 36186 18139 47 36186 15274 58 27672 15187 45 27672 9350 66 27672 8249 70 19422 12335 36 19422 8814 55 19422 5341 73 30675 14611 52 30675 14489 53 30675 7714 75 18696 4910 74 18696 4877 74 18696 4820 74 13614 7808 43 13614 6821 50 13614 4449 67

Table XV. Savings in Percent and Time Penalty in Cycles for Various Resource Constrained Schedulers Benchmark Circuits ARF BPF EWF DCT FIR HAL

RC-DFC SE ND 24–58 9–10 27–56 8–10 38–61 10–13 41–63 13–18 20–67 6–10 29–62 2–3

Shiue 2000 SE T 11–14 11–16 — — 14–14 17–20 — — — — 19–28 5–6

Sarrafzadeh 1999 SE T 16–20 17–24 — — 13–32 21–25 — — 16–29 10–15 — —

Johnson 1997a SE T 16–59 10–18 — — 11–50 12–24 — — 28–73 5–10 — —

(addition, subtraction, or comparison) and multiplier operations saved more energy. The energy savings increased as the time constraints relaxed from 1.5Tcp to 2.0Tcp . The energy savings from the proposed RC-DFC scheduling algorithm are listed in Table XV, along with other existing resource-constrained multiple-voltage scheduling algorithms. The minimum and maximum range of energy savings are also shown in the table. It is clear from the table that RC-DFC gives better energy savings for lesser time penalties. The energy savings for existing multiple-supply voltage-based time-constrained scheduling algorithms are shown in Table XVI. In all cases, the time constraints are 1.5 ∗ Tcp to 2.0 ∗ Tcp . It should be noted that in Table XV and Table XVI we have shown a broad picture of the proposed work with respect to existing work in the literature. It is obvious that the existing methods use different benchmarks and different resource or time constraints in their experiments. Moreover, while existing work has explored multiple-supply voltages only, we used combined multiple-supply voltages and dynamic frequency clocking. So it is not possible to provide a fair comparison. However, to get a broad idea of our proposed work ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

352



S. P. Mohanty and N. Ranganathan Table XVI. Percentage Savings for Various Time Constrained Schedulers Benchmarks ARF BPF EWF FDCT FIR HAL

TC-DFC 41–58 45–70 36–73 52–75 74–74 43–67

Chang 1997 40–63 — 44–69 43–69 — 41–61

Shiue 2000 38–76 — 13–76 — — 22–77

Manzak 2002 25–61 — 10–55 — — 19–62

with respect to existing work, we have provided Table XV and Table XVI showing the range of energy reduction (not fixed values) for common benchmarks. 8. CONCLUSIONS Our aim was to use frequency scaling concepts for energy-efficient highperformance special-purpose processor (ASIC) design. The energy reduction was achieved by voltage reduction and the performance is maintained by using DFC along with multiple voltages. We developed resource-constrained and time-constrained datapath scheduling algorithms based on dynamic frequency clocking. The use of dynamic frequency clocking could generate enough slack to apply reduced voltages, which in turn saves energy. It is observed that when using two supply voltage levels an average energy reduction of 41%, and for three supply voltage levels an average reduction of 46%, was obtained for the benchmarks using the RC-DFC algorithm. Similarly, for TC-DFC, an average energy reduction of 46% (for 1.5 × Tcp ) and 68% (for 2.0 × Tcp ) were obtained. The processor configurations for various benchmark circuits that would result in a minimum energy-delay-product were determined through experiments. The integration of such a scheduler into a low-power datapath synthesis tool will significantly benefit low-power processor design, especially for data-intensive applications. REFERENCES ANTOLA, A., PIURI, V., AND SAMI, M. 1998. A low-redundancy approach to semi-concurrent error detection in datapaths. In Proceedings of the Design Automation and Test in Europe. 266–272. BENINI, L., MACII, E., PNOCINO, M., AND MICHELI, G. D. 1998. Telescopic units : A new paradigm for performance optimization of VLSI design. IEEE Trans. Comput.-Aid. Des. Integrat. Circ. Syst. 17, 3 (Mar.), 220–232. BENINI, L., MICHELI, G. D., LIOY, A., MACII, E., ODASSO, G., AND PONCINO, M. 1999. Automatic synthesis of large telescopic units based on near-minimum timed supersetting. IEEE Trans. Comput. 48, 8 (Aug.), 769–779. BRYNJOLFSON, I. AND ZILIC, Z. 2000a. Dynamic clock management for low power applications in FPGAs. In Proceedings of the IEEE Custom Integrated Circuits Conference. 139–142. BRYNJOLFSON, I. AND ZILIC, Z. 2000b. FPGA clock management for low power. In Proceedings of the International Symposium on FPGAs. 219–219. BURD, T. AND BRODERSEN, R. W. 1995. Energy efficient CMOS microprocessor design. In Proceedings of the 28th Hawaii International Conference on System Sciences. 288–297. BURD, T., PERING, T. A., STRATAKOS, A. J., AND BRODERSEN, R. W. 2000. A dynamic voltage scaled microprocessor system. IEEE J. Solid-State Circ. 35, 11 (Nov.), 1571–1580. CHANG, J. M. AND PEDRAM, M. 1997. Energy minimization using multiple supply voltages. IEEE Trans. VLSI Syst. 5, 4 (Dec.), 436–443. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Efficient Datapath Scheduling



353

FETWEIS, G., CHIU, J., AND FRAENKEL, B. 1993. A low-complexity bit-serial DCT/IDCT architecture. In Proceedings of the IEEE International Conference on Communications. 217–221. GRUNWALD, D., LEVIS, P., AND FARKAS, K. I. 2000. Policies for dynamic clock scheduling. In Proceedings of the 2000 Operating Systems Design and Implementation. HSU, C. H., KREMER, U., AND HSIAO, M. 2000. Compiler-directed dynamic frequency and voltage scheduling. In Proceedings of the Workshop on Power-Aware Computer Systems. 65–81. JOHNSON, M. AND ROY, K. 1997. Datapath scheduling with multiple supply voltages and level converters. ACM Trans. Des. Automat. Electron. Syst. 2, 3 (July), 227–248. KIM, J. M. AND CHAE, S. I. 1996. New MPEG2 decoder architecture using frequency scaling. In Proceedings of the IEEE International Symposium on Circuits and Systems. 253–256. KOLLIG, P. AND AL-HASHIMI, B. M. 1997. Simultaneous scheduling, allocation and binding in high level synthesis. IEE Electron. Lett. 33, 18 (Aug.), 1516–1518. KUMAR, A. AND BAYOUMI, M. 1999. Multiple voltage-based scheduling methodology for low power in the high level synthesis. In Proceedings of the International Symposium on Circuits and Systems (Vol. 1). 371–379. LIN, Y. R., HWANG, C. T., AND WU, A. C. H. 1997. Scheduling techniques for variable voltage low power design. ACM Trans. Des. Automat. Electron. Syst. 2, 2 (Apr.), 81–97. MANZAK, A. AND CHAKRABARTI, C. 2002. A low power scheduling scheme with resources operating at multiple voltages. IEEE Trans. VLSI Syst. 10, 1 (Feb.), 6–14. MARTIN, T. L. AND SIEWIOREK, D. P. 2001. Nonideal battery and main memory effects on CPU speed-setting for low power. IEEE Trans. VLSI Syst. 9, 1 (Feb.), 29–34. MICHELI, G. D. 1994. Synthesis and Optimization of Digital Circuits. McGraw-Hill, New York, NY. MOHANTY, S. P. AND RANGANATHAN, N. 2003. Energy efficient scheduling for datapath synthesis. In Proceedings of the International Conference on VLSI Design. 446–451. MOHANTY, S. P., RANGANATHAN, N., AND CHAPPIDI, S. K. 2003. An ILP-based scheduling scheme for energy efficient high performance datapath synthesis. In Proceedings of the International Symposium on Circuits and Systems (Vol. 5). 313–316. MOHANTY, S. P., RANGANATHAN, N., AND KRISHNA, V. 2002. Datapath scheduling using dynamic frequency clocking. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI. 65–70. MONTEIRO, J., DEVADAS, S., ASHAR, P., AND MAUSKAR, A. 1996. Scheduling techniques to enable power management. In Proceedings of the ACM/IEEE Design Automation Conference. 349–352. MUSOLL, E. AND CORTADELLA, J. 1995. Scheduling and resource binding for low power. In Proceedings of the 8th International Symposium on System Synthesis. 104–109. PAPACHRISTOU, C. A. AND KONUK, H. 1990. A linear program driven scheduling and allocation method. In Proceedings of the 27th ACM/IEEE Design Automation Conference. 77–83. PEDRAM, M. 1996. Power minimization in IC design: Principles and applications. ACM Trans. Des. Automat. Electron. Syst. 1, 1 (Jan.), 3–56. PERING, T., BURD, T., AND BRODERSEN, R. W. 2000. Voltage scheduling in the lpARM microprocessor system. In Proceedings of the International Symposium on Low Power Electronics and Design. 96–101. POUWELSE, J., LANGENDOEN, K., AND SIPS, H. 2001a. Dynamic voltage scaling on a low-power microprocessor. In Proceedings of the 7th International Conference on Mobile Computing Network. POUWELSE, J., LANGENDOEN, K., AND SIPS, H. 2001b. Energy priority scheduling for variable voltage processor. In Proceedings of the International Symposium on Low Power Electronics and Design. 28–33. RANGANATHAN, N., VIJAYKRISHNAN, N., AND BHAVANISHANKAR, N. 1998. A linear array processor with dynamic frequency clocking for image processing applications. IEEE Trans. Circ. Syst. Video Techn. 8, 4 (Aug.), 435–445. SARRAFZADEH, M. AND RAJE, S. 1999. Scheduling with multiple voltages under resource constraints. In Proceedings of the IEEE Symposium on Circuits and Systems (Vol. 1). 350–353. SHIUE, W. T. AND CHAKRABARTI, C. 2000. Low-power scheduling with resources operating at multiple voltages. IEEE Trans. Circ. Syst.-II: Analog Digital Signal Process. 47, 6 (June), 536–543. Received November 2003; revised May 2004, August 2004; accepted October 2004 ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Voltage Scheduling Under Unpredictabilities: A Risk Management Paradigm AZADEH DAVOODI and ANKUR SRIVASTAVA University of Maryland

This article addresses the problem of voltage scheduling in unpredictable situations. The voltage scheduling problem assigns voltages to operations such that the power is minimized under a clock delay constraint. In the presence of unpredictabilities, meeting the clock latency constraint cannot be guaranteed. This article proposes a novel risk management based technique to solve this problem. Here, the risk management paradigm assigns a quantified value to the amount of risk the designer is willing to take on the clock cycle constraint. The algorithm then assigns voltages in order to meet the expected value of clock cycle constraint while keeping the maximum delay within the specified “risk” and minimizing the power. The proposed algorithm is based on dynamic programming and is optimal for trees. Experimental results show that the traditional voltage scheduling approach is incapable of handling unpredictabilities. Our approach is capable of generating an effective tradeoff between power and “risk”: the more the risk, the less the power. The results show that a small increase in design risk positively affects the power dissipation. Categories and Subject Descriptors: B.5.2 [Register-Transfer-Level Implementation]: Design Aids—Automatic synthesis, optimization General Terms: Algorithm, Reliability, Design Additional Key Words and Phrases: Predictability, voltage scheduling, low power, design closure

1. INTRODUCTION Estimation in design automation is marred by inaccuracies. Important design objectives like power are extremely hard to predict, especially at high levels of design flow. On the other hand, optimization of design objectives at system the level has a tremendous impact on design quality. Critical optimizations need to be performed in unpredictable scenarios. Risk management tries to control the maximum amount of error/ unpredictability associated with any estimation. Under this paradigm, the user specifies a risk which signifies the amount of inaccuracy the designer can “risk.” This risk is the “violation likelihood” of the constraint for potential gains in design quality. Authors’ addresses: Electrical and Computer Engineering Department, University of Maryland, College Park, MD 20742; email: {azade,ankurs}@eng.umd.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected].  C 2005 ACM 1084-4309/05/0400-0354 $5.00 ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005, Pages 354–368.

Voltage Scheduling Under Unpredictabilities



355

This article demonstrates this paradigm through the voltage scheduling problem, which assigns voltages to individual operations in a data flow graph (DFG) for power minimization. Assigning multiple voltages to operations is a very strong technique for power optimization. This comes from the quadratic dependence of voltage on power. This methodology was primary investigated by [Raje and Sarrafzadeh 1995] and generalized by [Chang and Pedram 1995]. Both these approaches assume accurate information about power and delay are available through estimation engines. Since the complete implementation information is not known, this assumption is far from valid. In this work, we extend voltage scheduling in a risk management paradigm. Instead of characterizing the delay and power at each voltage by exact values, we represent them by probability distributions. This is more realistic since the estimation is inaccurate. The designer specifies a clock constraint C and a risk factor R which represents the maximum number of clock cycles the designer is willing to “risk.” The algorithm then assigns voltages such that both the expected value of clock delay is ≤ C and the maximum latency is ≤ R while minimizing the power. The advantage of a risk management paradigm is that it gives control of the unpredictabilities to the designer. Depending on the amount of “risk” the designer is willing to take, the design quality changes. If the acceptable “risk” is high, the algorithm will be relaxed and will generate a lower power solution. Hence we can expect a “risk” versus power tradeoff. The designer should then be able to pick the appropriate point on this tradeoff curve which indicates a balance between design risk and design quality. In this endeavor, the expected value of the clock cycle constraint is never relaxed; hence this approach is different from slack-based techniques. We also demonstrate that small increase in “risk” can significantly affect power. The designers can experiment with this new parameter to generate solutions that meet their power constraints with a reasonable degree of predictability. The main contribution of this article includes proposing the risk management paradigm and demonstrating its effectiveness through the voltage scheduling problem. The rest of this paper is organized as follows. Section 2 contains a brief discussion on the issue of unpredictabilities. Section 3 reinvestigates the low-power voltage scheduling problem. Section 3.2 presents our risk management algorithm and Section 4 contains the experimental results. 2. UNPREDICTABILITIES: AN INTEGRAL ASPECT OF DESIGN AUTOMATION Design automation is a step by step procedure mainly comprised of optimization algorithms driven by estimation engines. Estimation of the design quality especially at a system level of design flow is an extremely complicated task to perform. Accuracy of estimation is severely limited at the system level since various design parameters are not known. On the other hand, design decisions taken at the high level greatly impact the design quality and the time to market. Hence critical design decisions need to be made in the presence of high degrees of estimation inaccuracies. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

356



A. Davoodi and A. Srivastava

There has been a tremendous amount of effort aimed at improving the prediction accuracy, especially with respect to power [Pedram 1996], in the past. Typically, most approaches try to increase the amount of implementation information (like wire-delay, leakage power) at the earlier stages in the design flow, hoping to increase the accuracy of prediction. Estimation tools try to predict the course of future/low-level optimizations in order to increase the implementation details. Unfortunately, even state-of-the-art estimation techniques cannot guarantee a reasonable degree of accuracy. Unpredictabilities creep into any such strategy. This article addresses the issue of unpredictability by proposing a strategy which does an accuracy/risk versus design quality tradeoff, hence managing the unpredictability and design quality. 2.1 Unpredictability: Sources and Impact Unpredictability is defined as the quantified form of accuracy [Srivastava and Sarrafzadeh 2002]. There are many sources of unpredictabilities for various cost functions such as power. At higher stages of design flow, the unawareness of the exact logic structure of the functional module makes exact estimation of power, area, delay, etc., impossible. Another important source of unpredictability in power estimation is unawareness of exact switching activity. Furthermore, exact values of low-level details like wire delay or wire capacitance cannot be determined either, forcing estimation to have inaccuracies. Hence inaccuracies/ unpredictabilities creep in due to an unawareness of exact implementation details. From a practical point of view, it is not possible to estimate the exact implementation details. Even if we have excellent estimation engines which capture each and every parameter responsible for a particular design objective, there is no way to predict the exact value of those parameters. One strong reason for this fact is the interdependence between various cost functions. At later stages of design flow, aggressive optimization of one design objective usually drastically affects other design objectives, hence invalidating any system level decisions/optimization based on early estimation. 2.2 Risk Management: Tradeoff Between Design Quality and Predictability In any optimization problem, the design constraints must be satisfied to generate a valid/feasible design. More specifically in this work, let us suppose the constraint C is the clock cycle constraint of a DFG and O is the power optimization objective obtained through voltage scheduling. Even if the generated solution satisfies C, there is no guarantee that it will be satisfied after the low levels of design optimization (logic synthesis, physical design). This might result in forcing several iterations. Hence, if area/noise is heavily optimized at the placement stage, the clock cycle constraint may get violated. Now let us assume that we know the kind of unpredictabilities associated with each solution of the optimization problem. This means for each solution to the voltage scheduling problem, we know not only the expected value of the clock cycles C but also its range. Let us also suppose ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Voltage Scheduling Under Unpredictabilities



357

the designer specifies a risk factor he/she is willing to take on C. This specifies the maximum violation of C the designer is willing to tolerate. For example, C might be 10 clocks but the designer is willing to tolerate C up to 12. In the presence of such a scenario, an unpredictability/risk management paradigm would approach the problem in a following style: (1) The expected value of the latency should always be less than C (previously C itself was the constraint). (2) The associated unpredictability is such that it is always less than the designer-specified risk R. (3) The objective value O is the best possible among this range of accepted (valid) solutions. Hence the unpredictability management paradigm manages the associated error to keep it within acceptable bounds. Note that the expected value of the constraint must still be less than C. It’s just that the worst case likelihood of the constraint can be tuned. It should be noted that the work done in modeling and optimization under uncertain/unpredictable cost functions is completely in contrast to our approach. As an example Jyu and Malik [1993, 1994], Tomiyama et al. [1998] and Tomiyama and Yasuura [1998] addressed the issue of predictability in their own individual ways. Tomiyama et. al. [1998] and Tomiyama and Yasuura [1998] addressed the unpredictability in delay estimation due to manufacturing defects. They presented techniques for resource binding and module selection such that the likelihood of failure is minimized. Jyu et. al. [1993, 1994] again captured the manufacturing defects and tried to maximize the probability of meeting the performance constraints in the presence of manufacturing variations. They proposed optimization methodologies and delay models to this effect. In contrast, the approach here is to make the designs more predictable in the presence of low-level optimization uncertainties such that the high-level decisions can be taken with greater confidence. Here, we expect a tradeoff between risk and design quality. If the designer’s specified risk is low then the design quality (power in the voltage scheduling problem) will be worse. If the risk is infinity then the problem will be the same as the traditional approach. 3. LOW-POWER VOLTAGE SCHEDULING: TRADITIONAL VERSUS RISK MANAGEMENT APPROACHES In this section we will initially overview the multiple supply voltage scheduling problem. Power is quadratically dependent on the supply voltage, as illustrated below [Chandrakasan et al. 1992]: Power = K .V 2 .β,

(1)

where K = proportionality constant, V = supply voltage, and β = switched capacitance factor Numerous techniques have been proposed to optimize the various components of this expression. Chen et al. [2001] described a strategy that reduces ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

358



A. Davoodi and A. Srivastava

the power dissipation through voltage scaling but without any performance penalty. A substantial amount of power is dissipated in clock trees. This is because its switching activity is the highest. Many articles have addressed the issue of clock power. Tellez et al. [1995] presented one of the approaches. At the behavioral level, a wide class of transformations can be done in order to achieve lower switched capacitance. These include low-power binding and resource allocation [Chang and Pedram 1995; Kruse et al. 2001]. Dynamic power management is a system-level power management technique for controlling the power dissipation by shutting down idle components [Chung et al. 1999]. Power has a nonlinear relationship with the voltage. Reduction of voltage reduces the power dissipation but also increases the gate delay. Hence there is a power performance tradeoff. Voltage scaling which reduces the supply voltage quadratically affects the power but also increases the delay, which can be presented as follows: Delay = K .V /(V − Vt )α ,

(2)

where K = proportionality constant, V = Supply voltage, and Vt = Threshhold voltage. 3.1 Traditional Approach System-level voltage scheduling techniques take a DFG as input and assign voltages to each operation such that the sum of the overall operation power is minimized and the given clock constraint is satisfied. This problem was tackled in Raje and Sarrafzadeh [1995] and solved optimally for general directed acyclic graphs (DAGs) assuming the same voltage-delay/power curves for all nodes. Chang and Pedram [1997] relaxed this assumption and solved the problem optimally for tree-like DFGs. The voltage scheduling problem has the following inputs and outputs: (1) Input: DFG, voltage/delay, voltage/power curves for all operations. This can include availability of different architectures for each operation type, which means each operation can have multiple voltage/delay, voltage/power curves. A delay constraint in clock cycle C is also known. (2) Output: Voltage (and architecture) assignments to each operation such that power dissipation is minimized while the clock constraint is satisfied. Figure 1 illustrates a typical variation of voltage/delay and voltage/power. The formulation assumes a set of predecided voltages (V1, V2, and V3 in this case) which will be available on the chip. The algorithm for tree-like DFGs has two iterations: the forward pass and the backward pass. In the forward pass, the DFG is traversed topologically from primary inputs (PIs) to primary outputs (POs). By the end of the forward pass, the cooptimal solutions that meet the clock delay constraint are determined. The optimal solution that results in minimum power is then chosen. This is followed by a backward pass where the voltage assignment corresponding to the optimal solution is determined. Next, more details on these two passes are provided. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Voltage Scheduling Under Unpredictabilities



359

Fig. 1. Power/voltage, delay/voltage variation.

3.1.1 The Forward Pass. Initially we introduce the notation used to describe the delay/power curves in Figure 1. The solution at a node i, referred to as nodeInfi , includes these curves. NodeInfi contains information for each predecided voltage and its associated delay and power values. In the forward pass, initially the DFG is traversed from PI to PO in topological order. At each node, the minimum power dissipation of the subgraph rooted at that node is stored as a delay curve corresponding to that node. This is represented as a set of ordered tuples (t, p) that contains a solution with the arrival delay of t clock steps and minimum power of p in the subgraph rooted at that node. Here we will be referring to this ordered set as arrivalInf for each node. ArrivalInfn contains information about the times the signal from node n becomes available. Given the arrivalInf of the fanins for any node, a function max is defined which considers the arrival times of all the fanins of the node simultaneously. The output of max is another delay curve referred to as arrivalMax for each node. Since the output of different fanins becomes available at different times, the signal coming from all the fanins is valid at the maximum of all these arrival times. This is done by merging the arrivalInf of each fanin. Considering any node i, max function looks at the arrivalInf of all the fanins of i and creates one delay curve (arrivalMax) representing the availability of the signal from the fanins. Note since different delay combinations might have the same maximum, among all combinations with the same maximum, the one with minimum power will be stored. This is briefly represented in equation below: arrivalMaxi = max(arrivalInf j ∀ j ∈ fanin i).

(3)

The function max symbolically finds the maximum of the arrivalInf curves for the fanins of each node. For each delay combination, power is simply the addition of the powers of each fanin. The power of fanin will be specified based on the assigned delay of the fanin. This is represented in equation below:  pi = pj . (4) ∀ j ∈fanini

When computing arrivalMaxi for each node i, the delay values greater than constraint C will be discarded. Hence the size of this function is always bounded. The arrivalMax obtained after max on all the fanins is then subjected to a ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

360



A. Davoodi and A. Srivastava

sum function. The input to sum is arrivalMax and nodeInf of the operation. Recall that each point on nodeInf reflects a voltage assignment that results in a known delay and its power. The sum function adds these curves to generate arrivalInf of the node. ArrivalInf signifies the arrival delay at the output of the corresponding node/operation and the total power dissipation of the subgraph rooted at that node. Each combination of arrivalMax and nodeInf will be added to generate a point on the arrivalInf of the node. Therefore arrivalInf is in fact the convolution of the two curves. This is computed for all nodes in forward topological order. By the end, by looking at the solutions stored at the primary output(s), the one with minimum power will be selected as the best solution. Note that the best solution is in a compact form reflecting the overall power of all nodes. A reverse topological traversal is necessary to specify the delay associated with each node. 3.1.2 The Backward Pass. In this stage, the DFG is traversed from PO to PI in topological order. At each step, the exact voltage and architecture decisions of the operations are made. At each node, the indices of nodeInf specify the solution for that node. The indices from arrivalMax specify the referring indices for the fanins of the node. Therefore the reverse topological traversal ensures that in the end at the PI the assignment for all operations has been done and the solution is complete. This approach will be optimal if the DFG is a tree. If the DFG is not a tree, in the backward pass, more than one choice exists for nodes with more than one fanout. Heuristics can then be used to make a good choice. Chang and Pedram [1997] discussed the use of level converters to maintain the integrity of the signal. Now that the traditional approach has been explained, in the next section the risk management alternative will be presented. 3.2 Risk Management for Voltage Scheduling Referring back to Section 2, we propose a new voltage scheduling methodology to address unpredictabilities. The previous algorithm assumed accurate information about delay/voltage and power/voltage variations. This is definitely not a realistic assumption. Figure 1 illustrates that, instead of having a fixed delay and power value for each voltage, we might have a distribution (the dotted distribution for each voltage). In this case, the traditional assumption of having accurate estimates made by the optimization algorithm is completely wrong. Let us suppose that somehow we know the distribution (range of values with associated probabilities) for both delay and power at each voltage. Let us assume that each distribution has a range of interest. The likelihood of the value falling outside this range is very low (typically within a distance of 3 × variance from the expected value). With this information, we would like to redefine our objective as follows: (1) Assume the delay/voltage and power/voltage curves for all architectures for each operation are provided such that for each voltage choice we also have the distribution of delay and power. A delay constraint of C clocks and a risk factor R (R ≥ C) are also provided. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Voltage Scheduling Under Unpredictabilities



361

(2) The objective is to minimize the expected value of power such that the expected value of delay ≤ C clocks and the worst case delay of the associated delay distribution is always less than the risk R. Since we add the power dissipation of all nodes in the final solution, replacing the power estimate for each operation by its expected value should be enough. This is because the expected value of a sum of n variables is the sum of the expected values. Handling delay constraint and delay risk requires a sophisticated algorithm which will be described later. 3.2.1 Estimating Operation Unpredictabilities. An important question is: “How do we know the distribution in delay and power for each voltage?” This is a very tricky problem. The basic premise of this work is that estimating distributions is easier than estimating exact values. Distributions will strongly depend on the kind of optimizations the subdesign will be subjected to in future, which in turn will depend on how critical a certain objective function becomes in that subdesign. It also depends on the sensitivity between different cost functions. A complex estimation engine that has models for tools (like gate sizing, buffer insertion) and not just libraries, along with consideration of sensitivity, will be needed. Such an estimation engine will take a subdesign and, depending on the criticality of different constraints, generate a range of values for each design objective. Of course, to the best of the authors’ knowledge, no such estimation system exists, although development of such a system is underway. Any further discussion of unpredictability estimation is beyond the scope of this work. The rest of this article assumes that these ranges are available as input. 3.2.2 Algorithms for Risk Management. Once again, we are given delay distributions for all voltages. Let us suppose that the delay distributions are provided in terms of clock steps (the general delay distribution can be easily transformed to generate this if the clock frequency is given). Hence we know, for each operation and each modular architecture for that operation, the expected value, maximum value, and distribution for delay in clocks for each prespecified voltage. The problem is to assign voltages and architectures to operations such that the new objective enumerated above can be satisfied. Once again the algorithm contains two passes over the DFG. The forward pass traverses DFG in forward topological order and the reverse pass in reverse order. In regard to the forward pass, before the forward pass starts, some preprocessing is performed. For each node, we first generate a double dimensional array which stores relevant information about that node and is referred to as nodeInfn for a node n. NodeInfn is essentially some way of representing the power and delay distributions for each voltage in a compact form. This is done through indexing the array in terms of the expected delay and the taken risk. In this array, the rows correspond to the expected delay, ranging from 1 to C. The columns signify the taken risk, ranging from 1 to R. The values stored at the (i, j ) cell signify the minimum (expected) power solution with expected delay of i clocks and max delay (or risk) of j clocks. This is stored in the power field of each cell (nodeInfn [i][ j ].expPow). Note that since, in the final objective, ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

362



A. Davoodi and A. Srivastava

the expected value of power is minimized, we are storing the expected value of each power distribution in the power field. The associated probability distribution of delay (which has an expected value of i clocks) is also stored in nodeInfn [i][ j ]. prob. We also remember the voltage and architecture that result in this solution. Note that this data is computed independently without any consideration in the inputs of these operations. Now we proceed with the forward topological traversal of the DFG. At each node, we compute another double dimensional data called arrivalInfn with the same rows and columns as nodeInf. arrivalInfn essentially represents the times (resulting from different combinations of delays of previous nodes) that the signal from node n becomes valid and available. Therefore from the definition it can be concluded that arrivalInfn should include all different delay combinations of the subgraph rooted at n to be able to reflect the arrival time of the signal for n. It should be clear that if node n is a primary input then nodeInfn and arrivalInfn are the same. The (i, j ) location of arrivalInfn contains the solution with minimum (expected) power dissipation of the subgraph rooted at node n and expected arrival delay of exactly i with max delay (risk) of exactly j clocks. The term arrival delay signifies the number of clock cycles it would take for the data of n to become available. Conceptually it is similar to arrival time in gate-level circuits. arrivalInfn [i][ j ].expPow stores the power value and arrivalInfn [i][ j ].prob stores the distribution in arrival delay. If a node n is not a primary input then the computation of arrivalInfn is more involved. Let us point out the following: the arrival time for a node is defined as arrivalInfn [i][ j ].prob = max(arrivalInfk [i ′ ][ j ′ ].prob|k ∈ Fanin(n)) + nodeInfn [i ′′ ][ j ′′ ].prob

(5)

Here arrivalInfn [i][ j ]. prob corresponds to one of the delay distributions stored at arrivalInfn . Similarly nodeInfn [i ′′ ][ j ′′ ] corresponds to a possible delay distribution for node n stored at nodeInfn . Here all fanins of node n should be considered. Equation (5) states that, given a combination of arrival delay distributions for fanins of a node n and a delay distribution for the node itself, the arrival distribution for the node will be the maximum of the arrival distributions of its fanins summed with the delay of the node. Note that Equation (5) should be considered for all valid combinations of fanins and node delay distributions (all valid i, j, i ′ , j ′ , i ′′ , j ′′ ). Recall the definitions of the rows and columns of the arrays. A row indexed by i indicates an expected arrival time of i clock steps. A column indexed by j indicates an exact risk of j units for the corresponding stored solutions. Based on these definitions, the following conditions should hold for any valid combination: —i ′ ≤ i and i ′′ ≤ i to ensure that the expected arrival of solution stored at row i is by clock step i; — j ′ + j ′′ = j to ensure all solutions stored at column j have a risk equal to j . ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Voltage Scheduling Under Unpredictabilities



363

Algorithm 1. MAX-PROB(P1,P2) INPUTS: P1,P2: Input delay distributions OUTPUT: P : Distribution which is Max(P1,P2) comment: P1 and P2 are bounded by R, the maximum risk in delay for (i =1; i ≤ R; i++) P[i] = 0; for (j = 1; j ≤ i; j++) P[i] + = P1[i] * P2[j] for (j = 1; j ≤ i; j++) P[i] + = P2[i] * P1[j] return P end Algorithm 2. ARRIVAL-MAX(n1,n2) INPUTS: Operations n1,n2 OUTPUT: arrivalMax(Max(n1,n2)) comment: arrivalInf(n1)and arrivalInf(n2) has been computed Allocate Memory for Temp for (i = 1; i ≤ C; i++) for (j = 1 j ≤ R; j++) d for (k = 1; k ≤ C; k++) for (l = 1; l ≤ R; l++) Temp[i,j,k,l].prob = MAX-PROB(arrivalInf[n1][i][j].prob,arrivalInf[n2][k][l].prob) Temp[i,j,k,l].expPow = arrivalInf[n1][i][j].expPow + arrivalInf[n2][k][l].expPow for (i = 1; i ≤ C; i++) for (j = 1; j ≤ R; j++) Find k,l,m,n such that EXPECT(Temp[k,l,m,n].prob) = i and MAX(Temp[k,l,m,n].prob) = j, and with minimum power. arrival-max[i][j].prob = Temp[k,l,m,n].prob arrival-max[i][j].expPow = Temp[k,l,m,n].expPow return arrival-max end

We need to first compute the max of all fanins probabilistically and add it with the delay of the node. Since we are traversing the DFG topologically, the arrivalInfn for all fanins is known. The computation of max is illustrated in Algorithms 1. Algorithm 1 describes standard max between two probability distributions. Algorithm 2 merges the arrivalInf of two fanins using the max function. Note that arrivalInf of each fanin contains distributions reflecting different scenarios where the signal becomes available from the fanin output. Therefore, Algorithm 2 uses Algorithm 1 to combine two sets of distributions corresponding to two fanins to generate a new set of distributions reflecting the arrival time of the two nodes. For all possible combinations of arrival delay distributions at the outputs of fanins, the algorithm calls MAX-PROB, which computes the max. Out of all these combinations, the ones with minimum (expected) power are chosen. In other words, if two resulting distributions have the same expected delay and risk factor indices, the one that has the minimum expected power will be stored. This data is stored in a temporary arrivalMax array. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

364



A. Davoodi and A. Srivastava

Algorithm 3. SUM-PROB(P1,P2) INPUTS: P1,P2: Input distributions OUTPUT: P : Distribution which is SUM(P1,P2) comment: P1 and P2 are bounded by R, the maximum risk comments: P can be more than R for (i = 1; i ≤ R; i++) for (j = 1; j ≤ R; j++) P[i + j] + = P1[i] * P2[j] return P end Algorithm 4. arrivalInf(n) INPUTS: n OUTPUT: arrivalInf for n compute arrival-max for n using Algorithm-2 for (i = 1; i ≤ C; i++) for (j = 1 j ≤ R; j++) for (k = 1; k ≤ C; k++) for (l = 1; l ≤ R; l++) Temp[i,j,k,l].prob = SUM-PROB(arrival-max[n][i][j].prob,nodeInf[n][k][l].prob) Temp[i,j,k,l].expPow = arrival-max[n][i][j].expPow + nodeInf[n][k][l].expPow for (i = 1; i ≤ C; i++) for (j = 1; j ≤ R; j++) Find k,l,m,n such that EXPECT(Temp[k,l,m,n].prob) = i and MAX(Temp[k,l,m,n].prob) = j, and with minimum power. arrivalInf[i][j].prob = Temp[k,l,m,n].prob arrivalInf[i][j].expPow = Temp[k,l,m,n].expPow return arrivalInf end

Finally, we have one arrivalMax data structure which contains the first term of Equation (5). This needs to be added to the delay of the node n in order to compute arrivalInfn . This again needs to be done probabilistically. This procedure is illustrated in Algorithm 4, which basically describes the convolution of two distributions which corresponds to their additions. All possible delay solutions of an operation are stored in nod eI nf . This is merged with the computed arrivalMax using a probabilistic addition function (Algorithm 3). Finally, for each expected delay i and risk j , the solution with minimum power is stored. After forward traversal, the arrivalInf for all the primary outputs is investigated to select the solution with minimum power. In regard to the backward pass, after reaching the PO of the DFG, we choose the solution with the minimum power dissipation. This corresponds to a certain expected clock value ≤ C and a certain risk ≤ R. Taking this solution, we traverse the DFG in topological order from PO to PI. At each step the fanout of the pertinent node forces a certain expected delay and risk factor ((i, j ) location in arrivalInfn ). This corresponds to a certain architecture and voltage for the node n and certain expected and max delay values for the fanins. In this fashion, the voltage and architecture for all the operations can be determined. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Voltage Scheduling Under Unpredictabilities



365

If the DFG is not a “tree,” then there will be nodes with more than one fanout. Therefore in the backward traversal the solution of such node can be potentially decided by any of its fanouts, and this might result in a suboptimal solution. This corresponds to several choices of (i, j ) locations in arrivalInfn imposed by different fanouts. We solve this deadlock by accepting the solutions in the following priority order E, W. The first priority E selects the selects the solution with minimum row index i. This corresponds to the solution with the minimum expected value of arrival time at n. If E is the same for two solutions, we choose the option that has the smallest worst-case arrival delay W (or column index j ). In this way, we try to ensure that the expected value constraint of the delay can be satisfied with a higher priority than the user-specified risk constraint. The algorithm for voltage scheduling with risk management generates an optimal solution for tree-like DFGs. In the forward pass, definitely the minimum expected power can be verified due to the dynamic programming approach. If the DFG is tree-like in the backward pass, this minimum solution can be exactly verified by verifying the solution of each node from its one fanout. 3.2.3 Postscheduling Resource Binding. We believe that resource binding can be conducted in a similar fashion as discussed in Chang and Pedram [1997]. Another aspect ignored in the described algorithm is the use of level converters to enable communication between different voltage levels. This algorithm can be trivially extended to consider level converters. This can be incorporated into Algorithm 4. When adding the power of two subsolutions, an extra term for level converter power can be added. 4. RESULTS The primary objective of this article is to propose the idea of risk management. The basic philosophy is to have a user-controlled parameter, which we call risk, which controls the amount of possible risk in meeting the constraints at the penalty of design quality. There are a few things that we want to demonstrate through our experiments. First, in the presence of unpredictabilities, traditional design methodology (voltage scaling in this case) results in invalid solutions. Second, we want to demonstrate how the penalty in design quality (power in this case) varies as the risk factor is varied. In our experiment to compute the power and delay distributions, we assumed the existence of four distinct voltages (5, 3.3, 2.4, 1.5). The estimated power and delay at these voltages were then calculated using Equations (1) and (2). Now each delay/power estimated should be assigned an associated unpredictability/distribution. We assume that, at each specific voltage, the distribution of delay and power follows a Gaussian distribution with the value predicted by Equations (1) and (2) as the mean. The variance was set as 20% of the mean. In reality, these values should be generated by an unpredictability estimation engine. Since no such system exists, we had to make this assumption. Now for large data sets, the pertinent statistics are usually Gaussian. Hence the Gaussian assumption should be accurate. But this ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

366



A. Davoodi and A. Srivastava Table I. Comparison of Traditional Versus Risk Management Approaches

Bench dct ecbenc-4 ellipt fft2 fir jdmer-4 jdmer-3 jdmer-1 motion-2 motion-3 noise-2

T-Clk(ns) 8 8 12 8 8 8 8 8 8 8 8

Traditional Const. Risk Power 25 50 212.63 25 50 128.65 40 80 356.29 15 30 319.11 35 70 157.93 20 40 253.26 8 20 399.86 8 20 381.40 14 30 404.00 14 30 404.00 8 20 773.21

Risk Management Expect.const. Risk Power 25 25 301.35 25 25 225.92 40 40 523.50 15 15 543.00 35 35 197.00 20 20 429.00 8 8 472.78 8 8 441.50 14 14 655.38 14 14 655.00 8 8 837.44

needs to be experimentally validated. Proceeding further, we calculated the expected value for each delay/power and gave that as the estimate for the traditional algorithm proposed by Chang and Pedram [1997]. Then we used our proposed algorithm to generate a solution in the risk management paradigm. Table I illustrates the comparison between the traditional and risk-driven approaches. The benchmarks are a mix of MediaBench [Lee et al. 1997] and traditional high-level synthesis bench-set. For the traditional approach, the provided delay constraint is reported in column 3. The power of the solution generated by the traditional approach is reported in column 5. Once the traditional solution is generated, the risk associated with it is evaluated, and is reported in column 4. As can be seen, this risk is very high compared to the input delay constraint. For the risk management approach, columns 6 and 7 report the expected value of delay constraint and maximum allowed risk provided as input. The expected value of delay constraint is the same as the delay constraint in the traditional solution. However, the maximum allowed risk is given as an input equal to the delay constraint. This means the designer is not willing to take any risk on timing constraint. It can be seen that the risk-driven approach always results in a valid solution, for which the power is reported in column 8. If the designer is not willing to risk (as in the risk management case), the result of the traditional approach should be considered as invalid. This illustrates the superiority of our approach over the traditional one. The next set of experiments illustrate the variation of design quality (power) as the risk is changed. Hence, without changing the expected value of timing, we illustrate the variation of power as the risk changes. Figure 2 reports this data for fft2 (the rest are omitted due to space considerations) for various expected delay constraints. It can be seen that as the risk factor is increased (which means that the designer can tolerate more risk on the expected delay constraint), the power dissipation reduces. Hence there is clearly a risk/design quality tradeoff. It can also be seen that a small increase in the risk factor tremendously affects the design quality, especially in Figure 2. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Voltage Scheduling Under Unpredictabilities



367

Fig. 2. FFT2: X -axis risk, Y -axis power.

5. CONCLUSION AND FUTURE WORK This research proposes a formal methodology of addressing unpredictabilities at the system level. The idea is to associate a risk factor with each constraint, and by controlling the risk factor an appropriate solution that is guaranteed to be within the prescribed limits can be generated. This was demonstrated using the voltage scheduling problem. The delay constraint was assigned a risk factor and an optimal algorithm was proposed which generates the minimum power dissipating solution within the clock and risk constraints. This illustrates a formal way of handling unpredictabilities. Our future work will include the development of an unpredictability estimation engine. Development of a formal synthesis system that addresses unpredictabilities is also underway. REFERENCES CHANDRAKASAN, A. P., SHENG, S., AND BRODERSEN, R. W. 1992. Low power CMOS digital design. IEEE J. Solid State Circ. 27, 4 (April), 472–484. CHANG, J. M. AND PEDRAM, M. 1995. Low power register allocation and binding. In Proceedings of the Design Automation Conference. 29–35. CHANG. J. M. AND PEDRAM, M. 1997. Energy minimization using multiple supply voltages. IEEE Trans. VLSI Syst. 5, 2. CHEN, C., SRIVASTAVA, A., AND SARRAFZADEH, M. 2001. On gate level power optimization using dual suppply voltages. IEEE Trans. VLSI Syst. 9, 5 (Oct.), 616–629. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

368



A. Davoodi and A. Srivastava

CHUNG, E. Y., BENINI, L., AND DEMICHELI, G. 1999. Dynamic power management using adpative learning tree. In Proceedings of the International Conference on Computer Aided Design. JYU, H.-F. AND MALIK, S. 1993. Statistical timing optimization of combinational logic circuits. In Proceedings of the International Conference on Computer Design. 77–80. JYU, H.-F. AND MALIK, S. 1994. Statistical delay modeling in logic design and synthesis. In Proceedings of the Design Automation Conference. 126–130. KRUSE, L., SCHMIDT, E., JOCHENS, G., STAMMERMANN, A., SCHULZ, A., MACII, E., AND NEBEL, W. 2001. Estimation of lower and upper bounds on the power consumption from scheduled data flow graphs. IEEE Trans. VLSI Syst. 9, 1 (Feb.), 3–14. LEE, C., POTKONJAK, M., AND MANGIONE-SMITH, W. H. 1997. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the International Symposium on Microarchitecture. PEDRAM, M. 1996. Power minimization in IC design: Principles and applications. ACM Trans. Des. Automat. Electron. Syst. 1, 1 (Jan.), 3–56. RAJE, S. AND SARRAFZADEH, M. 1995. Variable voltage scheduling. In Proceedings of the International Workshop on Low Power Design. SRIVASTAVA, A. AND SARRAFZADEH, M. 2002. Predictability: Definition, analysis and optimization. In Proceedings of the International Conference on Computer Aided Design. TELLEZ, G. E., FARRAHI, A., AND SARRAFZADEH, M. 1995. Activity-driven clock design for low power circuits. In Proceedings of the IEEE International Conference on Computer Aided Design. 62–65. TOMIYAMA, H., INOUE, A., AND YASUURA, H. 1998. Statistical performance driven module binding in high level synthesis. In Proceedings of the International Symposium on System Synthesis. 66–71. TOMIYAMA, H. AND YASUURA, H. 1998. Module selection using manufacturing information. In IEICE Trans. Fund. E81-A, 12 (Dec.), 2576–2584. Received July 2003; revised March 2004; accepted January 2005

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Aware Variable Partitioning and Instruction Scheduling for Multibank Memory Architectures ZHONG WANG and XIAOBO SHARON HU University of Notre Dame

Many high-end DSP processors employ both multiple memory banks and heterogeneous register files to improve performance and power consumption. The complexity of such architectures presents a great challenge to compiler design. In this article, we present an approach for variable partitioning and instruction scheduling to maximally exploit the benefits provided by such architectures. Our approach is built on a novel graph model which strives to capture both performance and power demands. We propose an algorithm to iteratively find the variable partition such that the maximum energy saving is achieved while satisfying the given performance constraint. Experimental results demonstrate the effectiveness of our approach. Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors—Compilers; C.3 [Special-Purpose and Application-Based Systems]—Signal processing systems; B.5.1 [Register-Transfer-Level Implementation]: Design—Memory design General Terms: Algorithms, Performance Additional Key Words and Phrases: Multiple memory banks, nonorthogonal architecture, instruction scheduling, operating mode, parallelism and serialism balance, runtime and energy saving tradeoff

1. INTRODUCTION To meet the ever increasing demands for higher performance and lower power on embedded systems, domain-specific processors with sophisticated architectures are being designed and deployed to better match target applications. One such architecture, often referred to as a nonorthogonal architecture [Cho et al. 2002], is characterized by irregular data paths comprising of a heterogeneous register set and multiple memory banks. A number of embedded DSP processors, for example, Analog Device ADSP2100, Motorola DSP56000, and NEC uPd77016, are based on this architecture. Compared to a large, centralized homogeneous register file, a heterogeneous (in terms of instruction usage) register Authors’ addresses: Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556; email: {zwang1,shu}@cse.nd.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected].  C 2005 ACM 1084-4309/05/0400-0369 $5.00 ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005, Pages 369–388.

370



Z. Wang and X. S. Hu

set organized in a distributed fashion can reduce both access time and power, as well as simplify the control logic and chip layout design [Desoli 1998]. The use of multibank memory can potentially improve the exploitation of instruction level parallelism, which in turn may decrease memory access time and energy consumption compared to a single large memory [Benini et al. 2000]. Harvesting the benefits provided by the nonorthogonal architecture hinges on effective compiler support. Parallel operations afforded by multibank memory give rise to the problem of how to maximally utilize the instruction level parallelism. Similarly, heterogeneous register sets increase the difficulty in deciding which register set to use for a certain instruction. A good compiler should consider the heterogeneous register set assignment and instruction scheduling together, since the two are closely related [Zeithofer and Wess 2001]. It is not difficult to see that compilation techniques for general-purpose architectures are not adequate to handle the irregularity in the architecture. In this article, we focus on two critical steps in the compilation process, that is, partitioning variables (or data) among the memory banks, and scheduling memory access operations. The decisions made in these two steps can have a significant impact on the overall program code size, execution time, and energy consumption. A number of articles (e.g., Sudarsanam and Malik [2000]; Saghir et al. [1996]; Lorenz et al. [2001]; Cho et al. [2002]; Leupers and Kotte [2001]; Zhuge et al. [2001]; Wuytack et al. [1969]) have investigated the use of multibank memory to achieve maximum instruction level parallelism (i.e., optimize performance). These approaches differed in either the models or the heuristics (which will be discussed in more detail in later sections). However, none of these works considered the combined effect of performance and power requirements. It is well known that memory components in embedded systems, particularly those for data-intensive applications, are a major power consumer [Catthoor et al. 1998]. To help ease the energy demands by memory, advanced memory modules are designed to operate in different modes, for example, active, idle, and sleep [Rambus 1999; Micron 1999], which have different operating currents. The exploitation of different operating modes together with multiple memory banks further complicates the problem of variable partitioning and memory operation scheduling. On top of this, performance requirement often conflicts with energy savings. Previous works have studied the effects of multiple memory operating modes at the higher levels such as program basic blocks, system tasks, or processes. However, significant energy savings and performance improvements can be obtained by exploiting memory operating modes and multibank memory simultaneously at the instruction level (which we will illustrate with an example in Section 3, as well as in the experimental results section). In this article, we present our approach to variable partitioning and memory access operation scheduling in the presence of multibank memory and multiple memory operating modes for maximizing energy savings without sacrificing performance. We reveal some observations to help categorize different cases. A novel memory access graph model, which simultaneously captures potential energy savings as well as potential performance improvements, is proposed to overcome the weakness of previous techniques. Based on this model, we have devised an iterative technique to find best energy savings while satisfying ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Aware Variable Partitioning and Instruction Scheduling



371

Fig. 1. The architecture of DSP56000.

the performance constraint. Experimental results show that our technique can achieve an average code size improvement of 14.15% over the unoptimized programs (for benchmarks in DSPStone [Zivoljnovic et al. 1994]) and 7.71% over the programs optimized by the original SPAM compiler (Princeton Spam Compiler Project; web site: www.ee.princeton.edu/spam/). Code size improvements translate directly to shorter execution times. Such improvements are quite significant compared with those obtained by existing approaches. In terms of energy savings from memory modules, our results on average outperform those from SPAM by around 47%. The experiments on the benchmark programs also showed that our algorithm runs much more efficiently than the original algorithm of SPAM. The rest of the article is organized as follows. Section 2 presents the background material and reviews previous work. Sections 3 and 4 describe the energy savings strategy and graph model, respectively. The variable partitioning and instruction scheduling algorithm is discussed in Section 5. Section 6 provides experimental results and, finally, Section 7 concludes the article. 2. PROBLEM FORMULATION AND RELATED WORK In this section, we briefly discuss essential features of the nonorthogonal architecture. We then formulate our problem and review related work. 2.1 Target Architecture and Problem Formulation Our target architecture consists of multiple memory banks and a heterogeneous register set. Associated with each memory bank is an independent set of address bus, data bus, and address generation unit (AGU). Figure 1 shows an example of such an architecture, that of Motorola DSP56000. DSP56000 has three sets of ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

372



Z. Wang and X. S. Hu

register files ({X0, X1}, {Y0, Y1}, {A, B}) and two memory banks (X, Y). We used this architecture in our experiments. However, our algorithm can be easily extended to architectures with a homogeneous register set or more memory banks. We consider memory modules used in the memory banks to have two operating modes, that is, the active mode and the low-current mode (standby or sleep) [Micron 1999]. The operating mode transition is controlled by the memory controller, whose states can be modified through a set of configuration registers [Delaluz et al. 2001]. The detailed discussion of controlling memory operating mode transition is beyond the scope of this article. In the active mode, a memory module performs normal read/write, while in the low-current mode, the memory module does not perform any memory operation and consumes much lower current than in the active mode. A memory module can switch between the two operating modes by incurring a certain time overhead. The memory module supply current during the mode transition is the same as in the active mode. For instance, for a Rambus RDRAM module [Rambus 1999], it takes a negligible amount of time to switch from the active mode into the standby (low-current) mode with the dynamic energy consumed in a cycle1 being reduced from 3.57 nJ to only 0.83 nJ, but it takes two clock cycles to switch back to the active mode. For a Micron SyncBurst SRAM module [Micron 1999], it takes two cycles to switch the module from the active mode into the snooze (low-current) mode with the dynamic energy consumed in a cycle2 changing from 5.61 nJ to only 0.17 nJ, and it takes another two cycles to switch back to the active mode. Clearly, in order to save energy by putting a memory module in the low-current mode, the consecutive idle time should be long enough to compensate for the transition time overhead. Furthermore, it is more beneficial to lump the idle times into a single long idle period than to disperse them. This presents some unique challenges to the problem we want to solve, which is formally defined as follows. Definition 2.1. Given a program (in the form of an intermediate code) and a nonorthogonal architecture specification, generate an instruction schedule which maximizes the memory operation parallelism and energy saving. It is not difficult to envision that increasing parallelism could have an adverse effect on energy savings. Our goal is to devise a methodology to trade off performance, that is, operation parallelism, and energy savings in the Pareto optimal solution set. 2.2 Related Work To our best knowledge, no existing work has investigated the problem defined in Definition 2.1. However, a number of researchers have studied different aspects of this problem, for example, maximizing memory operation parallelism, exploiting the memory module operating modes, etc. We briefly review them below to help clarify our unique contributions. 1 The

dynamic energy in a cycle is obtained from the measured supply current values associated with memory modules documented in the data sheets (for a 3.3-V, 2.5-ns cycle time, 8-MB module). 2 The dynamic energy in a cycle is calculated for a 3.3-V, 5.0-ns cycle time, 1-MB module. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Aware Variable Partitioning and Instruction Scheduling



373

2.2.1 Related Work on Operation Parallelism. Previous related work can be roughly divided into two main categories: those that use compacted intermediate code as the starting point (e.g., Sudarsanam and Malik [2000]; Saghir et al. [1996]; Lorenz et al. [2001]; Cho et al. [2002]), and those that start with uncompacted intermediate code (e.g., Leupers and Kotte [2001]; Zhuge et al. [2001]). Compacted intermediate code refers to the intermediate code that is compacted or scheduled by some heuristics such as list scheduling, to increase the instruction level parallelism without considering the data dependency. Since scheduling is done prior to exploring memory bank assignments, it is obvious that some memory-operation-pair combinations may be left out of consideration no matter which heuristic is used to compact the code. Thus, the approaches in the first category often fail to exploit many optimization opportunities. Techniques in the second category overcome this problem by using the uncompacted code to explore all possible pairs of memory operations as long as there are no dependencies between them. Therefore, we adopt the same philosophy as these techniques, that is, starting with the uncompacted code. Most existing techniques to explore parallelism adopt some graph model for variable partitioning. A major distinction between those techniques lies in the graph model definition. Reviewing these graph models can help explain why these models are not adequate. Given a program represented by a control data flow graph (CDFG), an undirected graph can be constructed to model the relationship among the variables in the program. The nodes in the graph represent all the local variables stored in memory. Partitioning the nodes in the graph into different groups then leads to partitioning the corresponding variables to different memory banks. The effectiveness of such an approach relies on modeling edge weights properly to capture all relevant information. A straightforward way of assigning edge weights is to connect two nodes with an edge of weight 1 if the two corresponding variables do not have data dependencies and the memory operations involving the variables can potentially overlap [Leupers and Kotte 2001] (as accessing such two variables in parallel may decrease the schedule length). However, such potential parallelism may not be always realizable due to certain timing constraints on the associated memory operations. (Recall that the operations are to be scheduled later for uncompacted code.) Zhuge et al. [2001] introduced the concept of possibility weight to capture the likelihood of parallelizing pairs of instructions. The model does improve on the simple graph model above, but it still has some deficiencies. For instance, to derive the edge weight between a pair of variables, they simply summed the possibility weights of all pairs of memory operations involving this variable pair in the entire procedure. Simply adding the possibility weights from different pairs of memory operations cannot correctly capture the scheduling freedom difference between the operation pairs. We will discuss these deficiencies in more detail in Section 4. None of the existing graph models consider how to exploit serialism in instruction execution to trade off performance for energy savings. In this article, we propose to use two lists to describe the edge weight in the graph model. By ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

374



Z. Wang and X. S. Hu

introducing one more dimension to the graph edge weight, we not only capture the serialism information among operations, but also overcome the deficiencies of previous models. 2.2.2 Related Work on Energy Savings. A number of research results have been published regarding saving energy through exploiting operating mode changes. The key idea is to distribute idle times judiciously through good scheduling. This can be achieved at various abstraction levels or design stages. For example, an on-line, low-power, task scheduling algorithm for multiple devices was presented in Lu et al. [2000]. An operating system- OS- based solution was proposed in Delaluz et al. [2002] where the OS scheduler manages power mode transitions by keeping track of module accesses for each process in the system. Several articles have been published to exploit the benefit of memory operating mode control. In Delaluz et al. [2001], a compiler-directed scheme was presented to reschedule the basic blocks such that longer consecutive memory idle times can be obtained. The techniques in Benini et al. [2000] and Luz et al. [2002] considered data organization in multibank memory such that data accesses can be concentrated in a small number of banks while other banks can be left in the low-current mode. No instruction scheduling was considered in these works. Our work focuses on the instruction level. By integrating energy consideration into the instruction scheduling stage, we can achieve additional energy savings without sacrificing performance. Note that our work complements the above mentioned energy savings techniques since it can be applied together with these other techniques. 2.2.3 Other Related Work. Some research related to multiple memory banks concerns with memory partitioning [Benini et al. 2000; Wuytack et al. 1999]. Given the memory access pattern of a class of programs, memory partitioning finds the best memory bank configuration, for example, the number of memory banks, the size of each bank, the number of ports for each bank, etc., from the viewpoint of instruction level parallelism or energy savings. It is a different problem from the one considered in this article in that the memory configuration is given in our architecture model, and we concentrate our effort on variable partitioning and instruction scheduling among memory banks. It is worth noting that Wuytack et al. [1999] deployed the model of conflict graph and conflict probability, which derives the graph information from uncompacted intermediate code. Though it addressed a different problem, it also showed that working on uncompacted code reveals more optimization opportunities. 3. IDLE TIME EXPLORATION As mentioned earlier, memory operating mode transition does not come for free. Extra clock cycles are needed to change between the active and low-current modes. Therefore, to exploit the low-current mode, longer consecutive idle times are more desirable for a memory bank. However, variable partitioning and instruction scheduling with only performance considerations may not lead to the best schedule in term of idle time distribution. For example, for the data flow ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Aware Variable Partitioning and Instruction Scheduling



375

Fig. 2. (a) An example DFG, where L (respectively, S), followed by an integer i represents a LOAD (respectively, STORE ) operation on variable i. Other nodes are nonmemory operations. Edges denote the precedence constraint between operations. (b) Schedule with only performance consideration. (c) Schedule when mode transition time is two cycles. (d) Schedule when mode transition time is four cycles.

graph (DFG) shown in Figure 2(a), a schedule with only performance consideration is shown in Figure 2(b), while better schedules with respect to both performance and energy are shown in Figures 2(c) and 2(d). In Figure 2(b), the memory modules cannot be switched to the low-current mode because all idle times are too short. In the latter two schedules, the idle slots are put together such that one or more memory modules can change to the low-current mode during the idle periods. In Figure 2(c), both memory banks can be put into low-current mode during the control steps 3 → 5 under the assumption that Rambus RDRAM is used, while in Figure 2(d), the second memory bank can be put into low-current mode during the control steps 4 → 6 under the assumption that a Micron SRAM module [Micron 1999] is adopted. Thus, we gain energy savings without affecting the schedule performance. Memory operation scheduling for energy savings is tightly related to that for maximum parallelism, but their different goals can lead to totally different schedules. For example, one could easily sacrifice all the parallelism by putting all variables in one memory bank, which gives the longest idle times for other memory banks. Therefore, a tradeoff exists between energy savings and performance. We consider the problem of maximizing the energy savings without degrading the performance (i.e., program execution time). ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

376



Z. Wang and X. S. Hu

In the following, we examine an ideal scenario in which no register constraint exists. In other words, all the variables can be loaded at the earliest time and stored at the latest time. The importance of this case will become clear in Section 5, where we will show that an operation schedule can be regarded as the ideal scenario after the mobility is calculated with the register constraint in mind. Given a control data flow graph (CDFG) representing the behavior of a program segment, assume that the desired schedule length is t, the number of memory operations in the ith memory bank is ni and the overhead for memory module mode transition is m clock cycles. For a given t, there exist three cases depending on the relationship of t, ni and m. Case 1.

min (t − ni ) > m, ∀i.

Maximal energy savings can be readily achieved by Lemma 3.1. The correctness of Lemma 3.1 is easy to prove and is omitted. LEMMA 3.1. If min (t − ni ) > m, ∀i, by simply pushing the LOAD (respectively, STORE) operations to the beginning (respectively, end) of the schedule, the maximal energy saving is achieved. The schedule in Figure 2(b) belongs to this case when the operating mode transition time is two cycles. The schedule with optimal energy savings can be readily obtained by Lemma 3.1 and is shown in Figure 2(c). Case 2.

min (t − ni ) ≤ m, ∃i and t ≥

n+m . N

In the above conditions, N denotes the number of memory banks, and n is the N total number of memory operations, that is, n = i=1 ni . These two conditions mean that consecutive idle times, which are long enough to change the memory module to low-current mode, can be formed in some but not all of the memory banks. To improve energy savings, one might consider moving the memory operations between banks to serialize more operations in one or more banks while leaving other banks with longer idle times. The goal is then to maximize “serialism” without degrading the performance. The example in Figure 2(d) illustrates such a thought for the SRAM memory module. The desire to increase serialism in this case complicates the variable partitioning problem. Case 3. t
η then Lmin = min(Lmin , Lschedule ), Tmax = max(Tschedule , Tmax ), λ +λ λs = p 2 s , Nstable = 0 end if 9. if Lschedule − Lmin > φ then λ +λ′ λs = s 2 s ,λ′s = λs , Nstable = 0 end if 10. if Nstable ≥ σ then break; end if 11. Recalculate the average weight of MAG. ENDWHILE 12. Output the corresponding variable partition and schedule

Also, how does one handle the register constraints? We shall discuss these issues in the next section, where we present our complete algorithm. 5. ALGORITHM Our variable partitioning and instruction scheduling algorithm is intended to be used in the back end of a compiler to optimize the intermediate code. The algorithm operates on the CDFG representation of a given piece of intermediate code. As the first step, it calculates the mobility for each operation with the register constraint in mind. Then the MAG is constructed based on the mobility information. With this MAG, the steps of average weight calculation, maximum cutting, variable partitioning, and instruction scheduling are iterated for a number of times to find the best values of λ p , λs . The algorithm framework is shown in Algorithm 1. In Algorithm 1, Tschedule represents the number of consecutive idle cycles for the current schedule, while Tmax represents the maximal value of all Tschedule . 4 λ′ s

is used to remember the value of λs in the previous loop iteration. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

382



Z. Wang and X. S. Hu

Lschedule and Lmin represent the current and minimum schedule lengths, respectively. φ is a user-specified parameter to indicate the latency constraint and defined as the allowed difference between the final and minimum schedule lengths. η is a user-defined threshold to measure whether Tschedule has a significant change. The algorithm will finish after Tschedule has not shown significant changes for σ number of loops. In Line 1 of Algorithm 1, we use the technique in Zeithofer and Wess [2001] to deal with the heterogeneous register set and register constraint. By dividing the register mapping into two stages, register allocation (before scheduling) and register assignment (after scheduling), the algorithm in Zeithofer and Wess [2001] obtains the benefit, but avoids the difficulty of considering register mapping and scheduling together. A heterogeneous register set is dealt with by transforming physical registers into virtual registers such that all virtual registers can be regarded as homogeneous. The concept of virtual registers provides a powerful methodology to check if a feasible register assignment exists for a specific schedule without the necessity of generating one. This allows the flexibility of considering the register constraint during scheduling, by simply checking if enough virtual register resources are available in each schedule step. Lots of effort can be saved by determining the detailed register assignment for the final schedule instead of every possible intermediate schedule. The approach in Zeithofer and Wess [2001] is similar to the register class concept in Jung and Paek [2001] , but with the advantage that the number of available virtual registers can be derived to constrain the mobility of each variable. In Line 5 in Algorithm 1, the well-known maximum spanning tree (MST) algorithm [Prim 1957] is used as the maximum-cut heuristic. Then variables are allocated to the memory banks under the rule that two variables having a MAG edge belonging to the cut must be in different banks. Note that any heuristic maximum-cut algorithm can be used at this step. The MST algorithm is preferred because of its popularity and easy implementation. The WHILE loop of the algorithm is used to find a point where the maximal energy savings is achieved for the desired performance. Because of the opposite forces of parallelism and serialism, more parallelism (larger λ p and smaller λs ) may bring better performance and less energy savings, while more serialism (smaller λ p and larger λs ) may bring more energy savings, but a possible deteriorated performance. Therefore, we introduce a process analogous to the binary search into the algorithm, trying to reach the best tradeoff point, that is, a set of λ p and λs values to achieve maximal energy savings for a given desired performance. In Step 8, if the performance is still in the desirable scope and more energy savings can be achieved through the last change of λs , we then push λs further toward the direction of serialism in the hope of getting more energy savings without degrading the performance. On the other hand, if the performance degrades too much, we move back λs toward parallelism in Step 9 in the hope of recovering the performance but still maintaining the gained energy savings. Finally, when the alteration of λs becomes too small to bring any meaningful change on either performance or energy savings, the algorithm exits from the WHILE loop. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Aware Variable Partitioning and Instruction Scheduling



383

Fig. 4. An example to illustrate the result of applying Algorithm 1. (a) The original assembly code given by SPAM compiler. (b) The assembly code after applying Algorithm 1.

The complexity of the algorithm depends on two factors, the schedule length (L) and the number of variables (N ). In the WHILE loop, step 6 takes O(L2 ) time, while the MST algorithm can be done in O(N 2 log N ) since there are at most O(N 2 ) edges in an MAG graph. Therefore, the algorithm complexity is O(w(L2 + N 2 log N )), where w is the number of iterations of the WHILE loop body. It is shown in the experimental section that w is normally quite small for reaching the final result. The algorithm can be easily extended to other systems with different architectural parameters. For instance, if a system has a different register file set, the technique in Zeithofer and Wess [2001] can still be used. The only difference will be the number of available virtual resources. By replacing the MST algorithm with some polynomial maximum k-cut heuristic [Goldschmidt and Hochbaum 1998], this algorithm framework can be extended to the system with more memory banks. Figure 4 shows an example assembly code which is the loop segment of a FIR filter in a DSPStone benchmark suite [Zivoljnovic et al. 1994]. Figure 4(a) is the assembly code obtained from SPAM. Figure 4(b) is the result after applying Algorithm 1, shown in the instruction format of opcode, operands, and two possible parallel move fields. There are 10 nodes and 34 edges in the MAG graph for the FIR filter. It takes four iterations to reach the final result. The reader is referred to Wang and Hu [2004] for the complete code example. In this example, variable y is put into Y memory bank due to the global variable partition consideration. The loop body length is reduced from seven to four clock cycles. The improvement is achieved by increasing the program parallelism (see instruction 9 in Figure 4(b)) and memory operation scheduling (moving the operations of loading (respectively, storing) variable y to the beginning (respectively, end) of its mobility according to Lemma 3.1, thus out of the loop body). The final code is a tradeoff between energy savings and schedule length. If the performance is the only emphasis, the loop body can be further reduced to three clock cycles by moving data array px to the Y memory bank. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

384



Z. Wang and X. S. Hu

Fig. 5. Assembly code size results. The results are normalized with respect to the SPAM results.

6. EXPERIMENTAL RESULTS We have implemented our algorithm in the SPAM compiler environment to replace the simulated annealing algorithm [Sudarsanam and Malik 2000] originally used by the Princeton project [(see Website www.ee.princeton. edu/spam/)]. The benchmarks used are from the DSPstone benchmark suite [Zivoljnovic et al. 1994], which contains C source code for various DSP kernels: Least Mean Square (B1), FIR (B2), N Real Update (B3), IIR Biquad (B4), Convolution (B5), N Complex Update (B6), 2-Dimensional FIR (B7), Matrix Multiplication (B8), 1st Adapted Predictor (B9), and Tone Detector (B10) routine in ADPCM. The intermediate code is generated by SUIF front end followed by SPAM code generation program, then fed as input to our algorithm to obtain the optimized assembly code. The assembly code size results are shown in Figure 5. The data include the original code size (Original), code size generated by the constraint-graph method (SPAM), code size generated by Zhuge et al. [2001] (Inde Graph5 ), and code size generated by our algorithm (VPIS). Figure 5 reveals that the methods of independence graph and our algorithm can both perform better than SPAM. This improvement can be attributed to exploiting more potential memory operations parallelism. Accredited to our comprehensive graph model and judicious selection of weight coefficients, our algorithm demonstrates a superior performance to the method of independence graph, as demonstrated in Figure 5. The execution time of the assembly code is correlated to the code size [Leupers and Kotte 2001], since the assembly code can be directly mapped to the schedule for the basic block. Moreover, due to the existence of loops in the DSP benchmark, we are able to observe even larger improvements when comparing the execution time of assembly code. The results are shown in Figure 6. We compare the energy savings results of our algorithms those with SPAM. Results from Inde Graph method are not included in this comparison, since it does not have the energy savings consideration. In fact, it can be regarded as a 5 Their

article used a greedy heuristic algorithm, similar to the MST algorithm, to partition the variables. Their results reported in this section were obtained from the MST algorithm for the sake of fairly comparing graph models. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Aware Variable Partitioning and Instruction Scheduling



385

Fig. 6. Execution time of assembly code, The results are normalized with respect to the SPAM results.

Fig. 7. Percentage of low-current cycles over the total execution clock cycle.

special case of our algorithm with the restriction of λ p = 1, λs = 0. As a simple comparison, we examine the generated assembly code. By counting the number of consecutive idle cycles (which must be larger than the operating mode transition time) for the two schemes in each basic block, the absolute number of idle cycles in which the memory module can be put into low-current mode is obtained by summing all such numbers in the entire procedure. The ratio of these idle cycles to the overall code size can be calculated. The average improvement of this ratio is 19.84%. This ratio comparison can give us the initial estimation of at least how much improvement can be achieved from the algorithm. With the multiple execution times of loop bodies, a larger improvement should be expected, which is demonstrated in the following comparison. As a more elaborate comparison, we have simulated the execution of the generated assembly code with Sim56000 (Motorola’s DSP56000 simulator). From the run profile, all the usable idle times are added together to get the total idle cycles during which the memory module can be put into low-current mode. By dividing this total of idle cycles by the benchmark’s total of execution clock cycles, the ratios of memory energy savings are derived for all benchmarks. Note the upper bound for this ratio equals the number of memory banks. The results are shown in Figure 7. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

386



Z. Wang and X. S. Hu

Fig. 8. Algorithm execution time.

Our algorithm achieves larger improvements for data-intensive (e.g., computation-intensive loop body) than control-intensive (e.g., procedure call, procedure initialization) application code, since data-intensive code involves more ALU operations which operate only on the register file. Furthermore, data-intensive code is usually executed many times, as in the computation loops in DSP applications. Based on these facts, we can see a larger energy savings in Figure 7 than the initial estimation given above. The average improvement of our approach over SPAM is 47.55%. One concern may be raised about the algorithm execution time because of the loop in the algorithm to find the best tradeoff point. Due to the fact that variable partitioning is not very sensitive to the change of average coefficients λ p and λs (a small change in these two coefficients does not change the variable partition), the algorithm generally can find the best tradeoff point in at most 20 loops. We ran the program in SUN Ultra Sparc2, and the algorithm execution time comparison is shown in Figure 8. In the figure, we normalized the algorithm execution time by the simulated annealing (SA) algorithm (adopted by SPAM) execution time, which is given in the unit of seconds on the top of each benchmark. Figure 8 shows that our algorithm takes much less time than the SA-based algorithm. Moreover with more complicated programs, the constraint graphs in SPAM become larger and each step in the annealing process takes a longer time. The SA algorithm execution time increases significantly with the increase in the constraint graph size, while our algorithm, by contrast, does not have to deal with the large graph for many times (at most 20 loops for our experiments). Therefore, the execution time improvement becomes more obvious for larger benchmarks. 7. CONCLUSION A variable partitioning and instruction scheduling algorithm is proposed to exploit the architecture with multiple memory banks and heterogeneous register set. The algorithm takes into account both instruction level parallelism and reducing system energy. A novel graph model is presented to capture ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Energy-Aware Variable Partitioning and Instruction Scheduling



387

both parallelism and serialism scheduling information. With such a model, the maximum instruction level parallelism can be exploited to improve the schedule performance. The idle intervals of the memory module are maximized under the constraint of the schedule performance, such that the system energy is reduced by changing memory modules to low-current mode for longer time. Experimental results demonstrate that our algorithm outperforms the previous techniques. As future work, the novel graph model presented in this article can be extended to other architectures besides multiple memory bank architecture, such as clustered architecture and distributed systems. One common characteristic for all these systems lies in the tradeoff between parallelism and serialism because of energy savings considerations, resource constraints, etc. For example, in a clustered architecture, it is important to balance the workload to all clusters to increase the performance. On the other hand, to reduce energy consumption, it is desirable to reduce the bus communication between clusters and put some clusters in low-power mode. Therefore, a graph model which can capture all such information is essential to a good scheduler. How to extend our proposed graph model to cope with other architectures is worth more research effort. REFERENCES AUSIELLO, G., CRESCENZI, P., GAMBOSI, G., KANN, V., MARCHETTI-SPACCAMELA, A., AND PROTASI, M. 1999. Complexity and Approximation. Springer Verlag, Berline, Germany. BENINI, L., MACII, A., AND PONCINO, M. 2000. A recursive algorithm for low-power memory partitioning. In Proceedings of the International Symposium on Low Power Electronics and Design. 78–83. CATTHOOR, F., WUYTACK, S., GREEF, E., BALASA, F., NACHTERGAELE, L., AND VANDECAPPELLE, A. 1998. Custom Memory Management Methodology—Exploration of Memory Organization for Embedded Multimedia System. Kluwer Academic Publishers, Dordrecht, The Netherlands. CHO, J., PAEK, Y., AND WHALLEY, D. 2002. Efficient register and memory assignment for nonorthogonal architectures via graph coloring and MST algorithms. In Proceedings of the ACM Joint Conference LCTES-SCOPES (Berlin, Germany). 130–138. DELALUZ, V., M. KANDEMIR, N. V., SIVASUBRAMANIAM, A., AND IRWIN, M. J. 2001. Hardware and software techniques for controlling DRAM power modes. IEEE Trans. Comput. 50, 11 (Nov.), 1154– 1173. DELALUZ, V., SIVASUBRAMANIAM, A., KANDEMIR, M., VIJAYKRISHNAN, N., AND IRWIN, M. J. 2002. Scheduling techniques for embedded systems: Scheduler-based DRAM energy management. In Proceedings of the 39th Conference on Design Automation. 697–702. DESOLI, G. 1998. Instruction assignment for clustered VLIW DSP compilers: A new approach. Tech. Rep. HPL-98-13. Hewlett-Packard Company, Palo alto, CA. GOLDSCHMIDT, O. AND HOCHBAUM, D. S. 1998. Polynomial algorithm for the k-cut problem. In Proceedings of the 29th Annual Symposium on the Foundations of Computer Science. 444–451. JUNG, S. AND PAEK, Y. 2001. The very portable optimizer for digital signal processors. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems. 84–92. LEUPERS, R. AND KOTTE, D. 2001. Variable partitioning for dual memory bank DSPS. In Proceedings of ICASSP. LORENZ, M., KOTTMANN, D., BASHFROD, S., LEUPERS, R., AND MARWEDEL, P. 2001. Optimized address assignment for DSPS with SIMD memory accesses. In Proceedings of the Asia South Pacific Design Automation Conference (ASP-DAC, Yokohama, Japan). 415–420. LU, Y. H., BENINI, L., AND MICHELI, G. D. 2000. Low–power task scheduling for multiple devices. In Proceedings of the 8th International Workshop on Hardware/Software Codesign. 39–43. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

388



Z. Wang and X. S. Hu

LUZ, V. D. L., KANDEMIR, M., AND KOLCU, I. 2002. Memory management and address optimization in embedded systems: Automatic data migration for reducing energy consumption in multi-bank memory systems. In Proceedings of the 39th Conference on Design Automation. 213–218. MICRON. 1999. 1mb syncburst SRAM data sheet. Micron Technology Inc., Boise, ID. Website: www.micron.com. PRIM, R. 1957. Shortest connection networks and some generalizations. Bell Syst. Tech. J. 36, 6. RAMBUS. 1999. 128/144-mbit direct RDRAM data sheet. Rambus Inc., Losaltos, CA. Website: www.rambus.com. SAGHIR, M., CHOW, P., AND LEE, C. 1996. Exploiting dual data-memory banks in digital signal processors. In Proceedings of the 7th International Conference on Architecture Support for Programming Language and Operating Systems. 234–243. SUDARSANAM, A. AND MALIK, T. S. 2000. Simultaneous reference allocation in code generation for dual data memory bank asips. ACM Trans. Des. Automat. Electron. Syst. 5, 2, 242–264. WANG, Z. AND HU, X. S. 2004. Variable partitioning and scheduling for multiple memory banks. Tech. rep. CSE Dept., University of Notre Dame, Notre Dame, IN. WUYTACK, S., CATTHOOR, F., JONG, G. D., AND MAN, H. D. 1999. Minimizing the required memory bandwidth in VLSI system realizations. IEEE Trans. VLSI Syst. 7, 4 (Dec.), 433–441. ZEITHOFER, T. AND WESS, B. 2001. Integrated scheduling and register assignment for VLIW–DSP architectures. In Proceedings of the 14th Annual IEEE International ASIC/SOC Conference. 339–343. ZHUGE, Q., XIAO, B., AND SHA, E. H.-M. 2001. Exploring variable partitioning in dual data-memory bank processors. In Proceedings of the 34th International Symposium on Micro-Architecture (MICRO-34), the 3rd Workshop on Media and Streaming Processors (MSP-3 Workshop). 42–55. ZIVOLJNOVIC, V., VELARDE, J., SCHAGER, C., AND MEYR, H. 1994. Dspstone—a DSP oriented benchmarking methodology. In Proceedings of the International Conference on Signal Processing Applications and Technology. Received June 2004; revised October 2004, December 2004; accepted December 2004

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement JASON CONG, JOSEPH R. SHINNERL, and MIN XIE UCLA Computer Science TIM KONG Magma Design Automation and XIN YUAN IBM Corporation, Microelectronics Division

Placement is one of the most important steps in the RTL-to-GDSII synthesis process, as it directly defines the interconnects, which have become the bottleneck in circuit and system performance in deep submicron technologies. The placement problem has been studied extensively in the past 30 years. However, recent studies show that existing placement solutions are surprisingly far from optimal. The first part of this tutorial summarizes results from recent optimality and scalability studies of existing placement tools. These studies show that the results of leading placement tools from both industry and academia may be up to 50% to 150% away from optimal in total wirelength. If such a gap can be closed, the corresponding performance improvement will be equivalent to several technology-generation advancements. The second part of the tutorial highlights the recent progress on large-scale circuit placement, including techniques for wirelength minimization, routability optimization, and performance optimization. Categories and Subject Descriptors: B.7.2 [Integrated Circuits]: Design Aids—Placement and routing; G.4 [Mathematical Software]: Algorithm design and analysis; J.6 [Computer-Aided Engineering]: Computer-aided design (CAD) General Terms: Algorithms, Design Additional Key Words and Phrases: Placement, optimality, scalability, large-scale optimization

1. INTRODUCTION The exponential growth of on-chip complexity has dramatically increased the demand for scalable optimization algorithms for large-scale physical design. This work was funded by Semiconductor Research Consortium, Contracts 2003-TJ-1091 and 99-TJ-686; National Science Foundation, Grants CCR-0096383 and CCF-0430077; and Magma Corporation and Xilinx Corporation under the California MICRO Program. Authors’ addresses: J. Cong, J. R. Shinnerl, and M. Xie, UCLA Computer Science Department, Campus Mailcode 159610, Los Angeles, CA 90095-1596; email: {cong,shinnerl,xie}@ca.ucla.edu; T. Kong, Magma Design Automation, 12100 Wishire Blvd., Suit, 480, Los Angeles, CA 90025; email [email protected]; Xin Yuan, IBM Corporation, Microelectronics Division, 1000 River Street, Mail Stop 862F, Essex Junction, VT 05452; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected].  C 2005 ACM 1084-4309/05/0400-0389 $5.00 ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005, Pages 389–430.

390



J. Cong et al.

Although complex logic functions can be composed in a hierarchical fashion following the logical hierarchy, recent studies [Cong 2001] show the importance of building a good physical hierarchy from a flattened or nearly flattened logical netlist for performance optimization. Because a logical hierarchy is usually conceived with little or no consideration of the layout and interconnect information, it may not map well to a two-dimensional layout solution. Therefore, large-scale global placement on a nearly flattened netlist is needed for physical hierarchy generation to achieve the best performance. This approach is even more important in today’s nanometer designs, where the interconnect has become the performance bottleneck. This tutorial highlights state-of-the-art placement optimization techniques. Section 2 presents recent studies on the quality and scalability of existing placement algorithms on a set of benchmarks with known optimal solutions. Section 3 reviews scalable paradigms for large-scale wirelength minimization. Timing optimization and routability optimization are discussed in Sections 4 and 5, respectively. Conclusions are given in Section 6. 2. GAP ANALYSIS OF EXISTING PLACEMENT ALGORITHMS Placement algorithms have been actively studied for the past 30 years. However, there is little understanding of how far solutions are from optimal. It is also not known how much the deviation from optimality is likely to grow with respect to problem size. Recently, significant progress was made using cleverly constructed placement examples with known optimal wirelength [Hagen et al. 1995; Chang et al. 2003b]. In this section, we summarize the results of these studies. 2.1 Placement Examples with Known Optima Recently, four suites of placement examples with known optimal wirelength (PEKO) were constructed [Hagen et al. 1995; Chang et al. 2003b]. The construction method takes as input an integer n and a net-profile vector of integers D. It then generates a placement example P with n placeable modules such that (i) the number of nets of degree i equals D(i), and (ii) P has a known globally optimal half-perimeter wirelength. The values of n and D used to construct PEKO either were directly extracted from the netlists of the ISPD98 suite originally from IBM [Alpert 1998] or were taken as those values scaled by a factor of 10. The PEKO suite is given in both GSRC BookShelf format and LEF/DEF format and is available online [Cong et al. 2004]. All the nets in PEKO are local, that is, the wirelength of every net has the minimum possible value. However, in real circuits, there may also be global connections that span a significant portion of the chip, even when they are optimally placed. Additional benchmark circuits were therefore constructed to study the impact of global nets [Cong et al. 2003b]. Circuits in the G-PEKU suite consist only of global nets connecting either an entire row or an entire column. For such circuits, an obvious upper bound on optimal wirelength is the sum of the lengths of the rows and columns. Circuits in the PEKU suite (Placement ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



391

Fig. 1. Average solution quality vs percentage of non-local nets, from PEKO (0% non-local nets) through PEKU (0.25% to 10% of nonlocal nets) to G-PEKU (100% non-local nets). Each data point is an average quality ratio for a given placer over all circuits in the given suite.

Examples with Known Upper bounds on wirelength) consist of both PEKO-style local nets and additional, randomly generated nonlocal nets. An upper bound on the optimal wirelength is derived simply by adding the wirelengths of nonlocal nets to the known total wirelength of the local nets. In the study [Cong et al. 2003b], the percentage of nonlocal nets was gradually increased from 0.25% to 10%. The G-PEKU and PEKU suites are also available online [Cong et al. 2004]. 2.2 Gap Analysis Results Four state-of-the-art placers from academia and one industrial placer were studied for optimality and scalability: Dragon v.2.20 [Wang et al. 2000], Capo v.8.5 [Caldwell et al. 2000b], mPL v.2.0 [Chan et al. 2003b], mPG v.1.0 [Chang et al. 2003a], and QPlace v.5.1.55 [Cadence Design Systems, Inc. 1999]. Experiments with Dragon, mPL, mPG and QPlace were performed on a SUN Blade 750 MHz running SunOs 5.8 with 4GB of memory. The experiments with Capo were performed on a Pentium IV 2.20GHz running RedHat 8.0 with 2GB of memory. To measure how close the placement results are to optimal, the ratio of a placement’s wirelength to the optimal wirelength (on PEKO) or its upper bound (on G-PEKU and PEKU) was computed. This ratio is called the “quality ratio.” An upper limit of 24 hours was placed on the run time; any process exceeding this limit was terminated. The results are summarized in Figure 1 and Figure 2. Figure 1 shows how the average quality ratios of these tools change with the percentage of nonlocal nets. Figure 2 shows how the run times of these tools changes with increase in ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

392



J. Cong et al.

Fig. 2. Run time vs. cell number for several algorithms on the PEKO suite.

cell number. We make the following observations: (i) None of the placers from the 2003 study achieves a quality ratio close to one.1 On PEKO, the wirelengths produced by these tools range from 1.41 to 2.09 times the optimal on average (see Figure 1) and 1.66 to 2.50 times the optimal in the worst case (not shown). On G-PEKU, the gap between their solutions and the upper bound varies between 79% and 102% in the worst case. Some placers may try to improve routability by sacrificing wirelength. However, given the gap between their wirelengths and the optimal value, there remains significant room for improvement in existing placement algorithms. (ii) The quality ratio from the same placer can vary significantly for designs of similar sizes but different characteristics. None of them produces consistently better results than another. On PEKO, mPL gives the shortest wirelength. However, its quality ratio shows an increase of more than 40% with a small increase of nonlocal nets. On G-PEKU, Capo gives the closest solution to the upper bound in most cases. On PEKU, Dragon’s wirelength gradually becomes the closest to the upper bound. This seems to suggest that more scalable and stable hybrid techniques may be needed for future generations of placement tools. (iii) Different placers displayed different scalability in run time and solution quality. None of them can successfully finish all the circuits of PEKO, because of either the run-time limit (e.g., Dragon), or memory consumption (e.g., Capo, mPL, mPG, QPlace). For those circuits they successfully placed, an average solution quality deterioration from 4% (on QPlace) to 25% (on mPL) can be observed when the problem size is increased by a factor of 10. 1 However,

recently improved versions of mPL and Feng Shui (not included in the 2003 study) have consistently obtained average quality ratios below 1.3, and other placers have also reported significant gains. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



393

It is not known whether the gaps on real circuits are similar to those observed on the benchmarks discussed above. A recent study [Wang et al. 2005] computes lower bounds for the optimal half-perimeter wire lengths of some widely used FGPA benchmarks and observes ratio gaps between 1.14 and 4.09 for placements computed by VPR [Betz and Rose 1997]. The construction of placement examples that resemble real circuits more closely, including examples optimized for timing [Cong et al. 2003a] or routability, is an active area of research. 3. SCALABLE PARADIGMS Scalability typically derives from some hierarchical form of computation. The use of hierarchy may be subtle or indirect but is rarely completely absent. In this article, we use the word “scalability” in the practical, operational sense and therefore consider not just O(N ) algorithms but rather any framework likely to have applicability lasting for several technology generations and circuit-size ranges. Wirelength, performance, power consumption, and routability are the typical objectives of VLSI placement. Of these, weighted total wirelength is a useful single representative, as (i) it can be optimized efficiently, and (ii) strategic, iterative net reweighting can be used to optimize other objectives, such as performance and routability. Our discussion is centered on methods for wirelength-driven global placement. The goal here is only an approximately uniform distribution of cells with as little total wirelength as possible. The problem of transforming a global placement to an overlap-free configuration is left to the detailed placement phase. The most promising large-scale approaches to wirelength-driven global placement can be broadly categorized by (i) the manner in which their hierarchies are constructed and traversed, (ii) the kinds of intralevel optimizations used and the manner in which they are incorporated into the hierarchy and coordinated with each other. At the highest level, we classify algorithms as top-down, multilevel, or flat; in practice, however, these categories overlap in interesting ways. Top-down algorithms (Section 3.1) use variants of recursive partitioning. Multilevel approaches (Section 3.2) compute placements of aggregates at several distinct levels of aggregation. These levels are most commonly formed by recursive clustering but may be defined instead by top-down partitioning. Flat approaches (Section 3.3), if scalable, use hierarchy for internal iterative computation while maintaining a consistent non-hierarchical view of the placement problem. 3.1 Recursive Top-Down Partitioning Among academic placement tools, all the leading top-down methods rely on variants of recursive circuit partitioning in some way. Seminal work on partitioningbased placement was done by Breuer [1977] and Dunlop and Kernighan [1985]. Most contemporary methods, including Capo [Caldwell et al. 2000b] and Feng Shui [Yildiz and Madden 2001a], have exploited further advances in fast algorithms for hypergraph partitioning to push these frameworks beyond their original capabilities. Fast, high-quality O(N ) partitioning algorithms give ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

394



J. Cong et al.

Fig. 3. Cutsize-driven partitioning-based placement. Rectangles represent movable cells, line segments represent cutlines, and ellipses and other closed curves represent nets. The recursive bipartitioning attempts to minimize the number of nets containing cells in more than one subregion.

top-down partitioning attractive O(N log N ) scalability overall. The asymptotic is O(N log N ) and not O(N ), because partitioning is always applied to cells, not to aggregates. 3.1.1 Cutsize Minimization. Simple and traditional recursive bisection with a cutsize objective can be used quite effectively with simple FiducciaMatheysses-style iterations. At a given level, each region is considered separately from the others in some arbitrary order. A spatial cutline for the region, either horizontal or vertical, can be carefully chosen. Given some initial partition, subsets of cells are moved across the cutline in a way that reduces the total weight of hyperedges cut without violating a given area-balance constraint. This constraint can be set loosely initially and then gradually tightened. As the recursion proceeds, cell subsets become smaller, and the cell-area distribution over the placement region becomes more uniform. Base cases of the bipartitioning recursion are reached when cell subsets become small enough that special end-case placers can be applied [Caldwell et al. 2000d]. A small example is illustrated after 3 levels of bipartitioning in Figure 3. Connections between subregions can be modeled by terminal propagation [Dunlop and Kernighan 1985], in which the usual cutsize objective is augmented by terms incorporating the effect of connections to external subregions. Other techniques for organizing local partitioning subproblems use Rent’s rule to relate cutsize to wirelength estimation [Wang et al. 2000; Yildiz and Madden 2001b]. Careful consideration of the order and manner in which subregions are selected for partitioning can be significant. For example, a dynamicprogramming approach to cutline selection can improve overall results by 5% or more [Yildiz and Madden 2001b]. In the multiway partitioning framework, intermediate results from the partitioning of each subregion are used to influence the final partitioning of others. Explicit use of multiway partitioning at each stage can in some cases bring the configuration closer to a global optimum than is possible by recursive bisection alone [Yildiz and Madden 2001a]. Cell replication and iterative deletion have been used for this purpose. Rather than attempt ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



395

to find the best subregion in which to place a cell, we can replicate the cell enough times to place it in once in every subregion, then iteratively delete only the worst choices. These iterations may continue until only one choice remains, or they may be terminated earlier, allowing a small pool of candidates to be propagated to and replicated at finer levels, By postponing further deletion decisions until better information becomes available, spurious effects from locally optimal subregion partitions can be diminished and the global result improved. Example: CAPO. To provide a concrete example of an implementation of fixed-die placement by top-down, cutsize-driven recursive bipartitioning, we briefly describe the CAPO package [Caldwell et al. 2000b] and a few of its extensions. The stated goals of CAPO are simplicity and automatic routability. In addition, the top-down flow and the use of leading multilevel hypergraph partitioner MLpart [Caldwell et al. 2000a] give CAPO superior speed and scalability. For simplicity, no explicit congestion management is used. Instead, decisions affecting the flow of the top-down recursive bipartitioning are carefully considered for their impact on routability. In CAPO, recursive cutsize-driven hypergraph-netlist bipartitioning is enhanced to support its ultimate goal of wirelength-driven circuit placement. Key considerations include nonuniformity of vertex weights, assignment of partition blocks to rectangular placement subregions, efficient solution of partitioning subproblems with small balance tolerances, and connections between cells in the subregion currently being partitioned and other, “external” subregions (terminal propagation). Movable cells in the circuit correspond to vertices in the partitioning instance. Vertex weights are determined by corresponding cell areas. Given a hypergraph netlist and a placement region or subregion, the multilevel partitioner MLpart [Caldwell et al. 2000a] is used to divide the set of movable cells into two subsets of nearly equal total areas, when the number of cells exceeds 200. For fewer than 200 cells, enhanced Fiduccia-Matheysses (FM) partitioning [Fiduccia and Mattheyses 1982; Caldwell et al. 2000c] is used directly (MLpart is also based on this enhanced version of FM). The region must then be split, either vertically or horizontally, so that the resulting subregions hold the partition blocks, and whitespace is distributed as evenly as possible. CAPO uses a horizontal cut if the number of standard-cell rows contained in the subregion exceeds M/15, where M is the number of movable cells in the subregion. Otherwise, it chooses cut direction in order to make the aspect ratios of the resulting subregions as close to one as possible. For a vertical cut, whitespace can be distributed perfectly evenly, but horizontal cuts are constrained to lie between uniform-height rows of the standard-cell layout. Respecting standard cell row boundaries in this fashion greatly facilitates legalization of the final global placement. However, a recent study shows that this restriction occasionally overconstrains end cases and increases wirelength [Agnihotri et al. 2003]. The authors of this study show that Feng Shui’s “fractional-cut” relaxation of row boundaries during the partitioning can considerably improve results, when it is followed by careful displacement-minimizing legalization, such as dynamic-programming based row-assignment. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

396



J. Cong et al.

Once the number of cells to be placed in any subregion decreases below 35, CAPO employs time-limited branch-and-bound heuristics to obtain an optimal or nearly optimal end-case solution [Caldwell et al. 2000d]. Finally, greedy refinement of cell orientations further improves wirelength and routability. Much of CAPO’s performance derives from its placement-driven enhancements to its core FM partitioner. These enhancements are concerned mainly with (i) avoiding the “corking effect,” as described below, caused by improper handling of large variations in movable cell areas and/or tight area-balance constraints, and (ii) terminal propagation. Given any initial partition, FM considers sequences of single, maximum-gain cell moves from one partition block to the other. It maintains a list of “buckets” for each partition block, where the kth bucket in each list holds the vertices which, when moved to the opposite block, will reduce the total number of nets cut by k. However, a cell will not be moved if the move violates the vertex-weight balance constraint. The original version of FM [Fiduccia and Mattheyses 1982] is focused primarily on hypergraph instances with all vertex weights equal. It does not specify any ordering for vertex weights are not ordered within buckets. According to the partitioning studies on which Capo is based [Caldwell et al. 2000c], many leading implementations of FM-based partitioning, in order to reduce run time, simply terminate searches in gain buckets when the first vertex in the kth bucket cannot be moved without violating the vertex-weight balance constraint. This short-cut has dire consequences, however, when cell sizes vary widely. If a cell too large to be moved occurs at the front of a bucket, it prevents any of the other moves in the bucket from being examined. By avoiding this “corking” effect, CAPO significantly improves its results [Hagen et al. 1997]. CAPO prevents corking in two different ways. First, gain buckets are maintained as last-in, first-out (LIFO) stacks. Second and most importantly, it starts each bipartitioning subproblem with a relaxed areabalance constraint and gradually tightens the constraint as partitioning iterations proceed. Initially, the maximum allowed area imbalance is set to 20% of total area, or three times the area of the largest movable cell, whichever is greater. As the balance tolerance decreases below the area of any cell, that cell is locked in its current partition block. The final subproblem balance tolerance is selected so that, given an initial whitespace budget, enough relative whitespace in endcase subproblems is ensured that overlap-free configurations can typically be found. In CAPO’s original implementation, terminal propagation is simple. A given subproblem consists of (i) a pair of adjacent rectangular subregions, say, A and B, (ii) a collection of movable cells S to be divided between A and B, and (iii) a list of nets, each of which contains cells in S and, possibly, other external cells in other subregions as well. A hyperedge which contains some external cells in subregions closer to A and also other external cells in subregions closer to B will necessarily be cut and is therefore ignored. A hyperedge whose external cells (those not in S) are all closer to A than to B is treated as if its external cells are all fixed at A’s center. Similarly, a hyperedge whose external cells are all closer to B than to A is treated as if its external cells are all fixed at B’s center. Although this simple strategy is effective, recent studies [Yildiz and ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



397

Madden 2001a; Kahng and Reda 2004] demonstrate significant improvement by iterative refinement. Initially many nets contain ambiguous terminals, that is, external cells whose locations are not yet known precisely enough to determine which of subregions A and B they are closer to. However, after a complete sweep of bipartitioning, many of these ambiguous terminals become unambiguous relative to given subproblems. Repartitioning such subproblems with the improved terminal propagation leads to improved average results—for example, 5% cutsize reduction after 3 complete sweeps at each level [Kahng and Reda 2004]. 3.1.2 Incorporating Advances in Floorplanning. Recent improvements to Capo include the incorporation of fixed outline floorplanning to improve handling of large macro blocks in mixed-size placement [Adya et al. 2004]. Min-cut placement proceeds as described above until certain ad-hoc tests suggest that legalization of a subset of blocks and cells within their assigned subregion may be difficult. At that point, the cells in that subregion are aggregated into soft clusters, and annealing-based fixed outline floorplanning is applied to the given subproblem [Adya and Markov 2002]. If it succeeds, the macro locations in its solution are fixed. If it fails, it must be merged with its sibling subproblem, and the merged parent subproblem must then be floorplanned. This step therefore initiates a recursive backtracking through ever larger ancestor subproblems. The backtracking terminates as soon as one of these ancestor subproblems is successfully floorplanned. The ad-hoc tests are chosen to prevent long backtracking sequences on most test cases, as the floorplanner does not scale well to large subproblems. Adya et al. [2004] observe that it is typically possible to define the ad-hoc tests so that the transition from min-cut partitioning to fixed-outline floorplanning does not impair scalability. However, as the algorithm cannot ensure the legalizability of the subproblems it generates by min-cut partitioning, it cannot prevent the possibility of a long backtracking sequence or a failure, especially on difficult low-whitespace instances. The general challenge of ensuring the legalizability of subproblems within a min-cut-partitioning-based floorplanning or placement has been addressed by Patoma [Cong et al. 2005a, 2005b]. Beginning with the given instance itself, Patoma employs fast and scalable area-driven floorplanning before cutsizedriven partitioning in order to confirm that the problem can be legalized as given. This area-driven “pre-legalization” ignores wirelength but serves as a guarantor of the legalizability of subsequent steps. It is extremely robust; no failure on any known public-domain benchmark circuit has been observed. Given the guarantor legalization at a given level, cutsize-driven partitioning proceeds at that level. The flow then proceeds recursively on the subproblems generated by the cutsize-driven partitioning, each subproblem being legalized before it is solved. When prelegalization fails, the failed subproblem is merged with its sibling, and the previously computed legal guarantor solution to this parent subproblem is improved to reduce wirelength. The flow thus guarantees the computation of a legal placement or floorplan, under the very modest assumption that the initial attempt to prelegalize the given instance succeeds. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

398



J. Cong et al.

Experiments with this flow demonstrate significantly more robust performance on mixed-size benchmarks with white space between 1 and 10%. 3.1.3 Partitions Guided by Analytical Placements. An oft-cited disadvantage of recursive bisection is its alleged tendency to ignore the global objective as it pursues locally optimal partitions. Approximating wirelength by cutsize in the objective may also degrade the quality of the final placement. A radically different approach, first introduced in Proud [Cheng and Kuh 1984; Tsay et al. 1988] and subsequently refined by Gordian [Kleinhans et al. 1991], is to use continuous, iteratively-constrained quadratic star-model wirelength minimization over the entire circuit to guide partitioning decisions. The choice of a quadraticwirelength objective helps avoid long wires and facilitates the construction of efficient numerical linear-system solvers for the optimality conditions, for example, preconditioned conjugate gradients. I/O pads prevent the cells from simply collapsing to a single point. Linear wirelength can still be asymptotically approximated by iterative adjustments to the net weights [Sigl et al. 1991]. Following this “analytical” placement, each region is then quadrisected, and cells are partitioned to subregions in order to further reduce overlap and area congestion. In Gordian, carefully chosen cutlines and FM-based cutsize-driven partitioning and repartitioning are used. Cell-to-subregion assignments are loosely enforced by imposing and maintaining a single center-of-mass equality constraint for each subregion. As constraints accumulate geometrically, degrees of freedom in cell movement are eliminated, and the quadratic minimization at each step moves cells less and less. Example: BonnPlace. BonnPlace [Vygen 1997; Brenner and Rohe 2002] is the leading contemporary variation of placement by top-down recursive partitioning guided by analytical quadratic wirelength minimization. Global unconstrained minimization of quadratic wirelength (cf. (2) below) determines a starting configuration; the presence of fixed I/O pads, typically along the circuit boundary, prevents the movable cells from collapsing to a single point. The cells are then partitioned into four disjoint subregions, in linear time, in a manner that essentially minimizes the sum of their rectilinear displacements from their starting positions [Vygen 2000]. BonnPlace does not explicitly impose equality constraints into the subseqent analytical minimization to preserve these partitioning assignments, as Gordian does. Instead, it directly alters the quadraticwirelength objective to minimize the sum of all cells’ displacements from their assigned subregions. The following four steps are then repeated until subregions become small enough that legalization and detailed placement by local perturbations can be applied [Brenner et al. 2004]. (1) For each pair of connected cells (clique model) not in the same subregion, express their contributions to the global objective as squared Manhattan distances to their respective subregion boundaries, in the direction of the segment joining them. (2) Minimize this quadratic objective over the entire chip simultaneously. (Don’t minimize over subregions separately in sequence.) By moving all cells in all regions at the same time, a higher quality solution can be obtained. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



399

(3) Quadrisect each subregion by displacement-minimizing partitioning. (4) Suppose there are now n2 = n × n subregions. FOR EACH of the (n − 1)2 (overlapping) 2 × 2 windows of subregions, (4.1) Perform quadratic WL minimization separately over just the cells in the current 2 × 2 window, allowing these cells to move anywhere within this window but not into other windows. (4.2) IF the center of mass of the result of 4.1 can be maintained by some overlap-free placement of cells within the current window THEN Repartition the cells in the current window into the four contained subregions. ELSE Redo 4.1 subject to a center-of-mass equality constraint, then repartition. (4.3) Do QP minimization over the current window with an updated objective respecting the partitioning assignments from 4.2. END FOR 3.1.4 Iterative Ref inement. Following the initial partitioning at a given level, various means of further improving the result at that level can be used. In BonnPlace (Section 3.1.3), unconstrained quadratic wirelength minimization over 2 × 2 windows of subregions is followed by a repartitioning of the cells in these windows. Windows can be selected based on routing-congestion estimates. Capo [Caldwell et al. 2000b] greedily selects cell orientations in order to reduce wirelength and improve routability. Feng Shui [Yildiz and Madden 2001a] follows k-way partitioning by localized repartitioning of each subregion. Some leading partitioning-based placers also employ time-limited branch-andbound-based enumeration at the finest levels [Caldwell et al. 2000d]. In Dragon [Wang et al. 2000; Sarrafzadeh et al. 2002], an initial cutsizeminimizing quadrisection is followed by a bin-swapping-based refinement, in which entire partition blocks at that level are interchanged in an effort to reduce total wirelength. At all levels except the last, low-temperature simulated annealing is used; at the finest level, a more detailed and greedy strategy is employed. Because the refinement is performed on aggregates of cells rather than on cells from the original netlist, Dragon may also be grouped with the multilevel methods discussed next. 3.2 Multilevel Methods Placement algorithms in the multilevel paradigm have only recently drawn attention [Sankar and Rose 1999; Chan et al. 2000; Chang et al. 2003a; Chan et al. 2003a, 2003b]. These methods are based on coarsening, relaxation, and interpolation, defined as follows. (i) Coarsening. Hierarchies are built either from the bottom up by recursive aggregation or from the top down by recursive partitioning. (ii) Relaxation. Localized optimizations are performed at every aggregation level. (iii) Interpolation. Intermediate solutions are transferred from each aggregation level to its adjacent finer level. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

400



J. Cong et al.

Additionally, the order in which the various problems at the various levels are solved can be important. The simplest and most common approach is simply to proceed top down, from the coarsest to the finest level, once the aggregation hierarchy has been constructed [Sankar and Rose 1999; Chan et al. 2000; Sarrafzadeh et al. 2002; Kahng and Wang 2004]. When the hierarchy is defined by recursive bottom-up clustering, the combined flow of recursive clustering followed by recursive top-down optimization and interpolation is traditionally referred to as a “V-cycle” (Figure 5; the bottom of the ’V’ corresponds to the coarsest or “top” level of the hierarchy). However, studies show considerable improvement is possible by repeated traversals and reconstructions of the hierarchy in various orderings [Brandt and Ron 2002; Chan et al. 2003b], as in traditional multiscale methods for PDEs [Briggs et al. 2000]. We refer to the organization of these traversals as iteration flow. The scalability of the multilevel approach is straightforward to obtain and understand. Provided relaxation at each level has order linear in the number Na of aggregates at that level, and the number of aggregates per level decreases by factor r < 1 at each level of coarsening, say Na (i) = r i N at level i, the total order of a multilevel method is at most cN (1 +r +r 2 + · · ·) = cN /(1 −r). Higherorder (nonlinear) relaxations can still be used, if their use is limited to subsets of bounded size, e.g., by sweeps over overlapping windows of contiguous clusters at the current aggregation level. 3.2.1 Coarsening. A hierarchy of problem formulations can be defined either from the bottom up by recursive aggregation or from the top down by recursive aggregation. Traditional multiscale algorithms form their hierarchies by recursive clustering or generalizations thereof. However, the importance of limiting cutsize makes partitioning attractive in the placement context [Sarrafzadeh et al. 2002; Kahng and Wang 2004]. Typically, clustering algorithms merge tightly connected cells in a way that eliminates as many nets at the adjacent coarser level as possible while respecting some area-balance constraints. Experiments to date suggest that relatively simple, graph-based greedy strategies like First-Choice vertex matching [Karypis 1999, 2003] may be more effective than more sophisticated ideas like edge-separability clustering (ESC) [Cong and Lim 2000] that attempt to incorporate estimates of global connectivity information. How best to define coarse-level hyperedges without explosive growth in the number and degree of coarsened hyperedges relative to coarsened vertices remains an important open question [Hu and Marek-Sadowska 2004]. First-Choice clustering is the method currently used by mPL [Chan et al. 2000, 2003a, 2003b] and mPG [Chang et al. 2003a]. A graph is defined on the netlist vertices with each edge weighted by the “affinity” of the given two vertices. The affinity may represent some weighted combination of complex objectives, such as hypergraph connectivity, spatial proximity, timing delay, area balance, coarse-level hyperedge elimination, etc. Each vertex is paired with some other vertex for which it has its highest affinity. This maximum-affinity pairing is not symmetric and is independent of the order in which vertices ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



401

Fig. 4. First-Choice Clustering on an affinity graph. Darkened edges in the original graph are of maximal weight for at least one of their vertices. Note that vertex d has maximal affinity for vertex b, but vertex b has maximal affinity for vertex f .

are considered (see Figure 4). The corresponding maximum-affinity edges are marked and define a subgraph of the affinity graph; connected components of this subgraph are clustered and thus define vertices at the next coarser level. A common objection to clustering is that its associations may be incorrect and therefore lead subsequent iterations to the wrong part of the solution space. To reduce the likelihood of poorly chosen clusters, the notion of a cluster can be generalized by weighted aggregation. Rather than assign each cell to just one cluster, we can break it into a small number of weighted fragments and assign the fragments to different coarse-level vertices; these are no longer simple clusters and are instead called aggregates. During interpolation, a cell’s initial, inherited position is then typically determined by that of several aggregates as a weighted average [Chan et al. 2003b]. Clustering, also called strict aggregation, is a special case of weighted aggregation. Both are associated with algebraic multigrid (AMG) methods [Brandt 1986; Briggs et al. 2000] for the hierarchical, numerical solution of PDE’s over unstructured discretizations. 3.2.2 Initial Placement at Coarsest Level. A placement at the coarsest aggregate level may be derived in various ways. Because the initial placement may have a large influence at subsequent iterations, and because the coarsest-level problem is relatively small, the placement at this level is typically performed with great care, to the highest quality possible. mPL [Chan et al. 2000, 2003a, 2003b] uses nonlinear programming (Section 3.2 below); mPG uses simulated annealing [Chang et al. 2003a]. How to judge the coarse-level placement quality is not necesssarily obvious, however, as the coarse-level objective may not correlate strictly with the ultimate fine-level objectives. For this reason, multiple iterations over the entire hierarchical flow are important [Brandt and Ron 2002; Chan et al. 2003b]. 3.2.3 Relaxations. Relaxations at a given level are fast and relatively localized. The global view comes from the multilevel hierarchy, not from the intralevel relaxations. Almost any algorithm can be used, provided that it can support (i) incorporation of complex constraints (ii) restriction to subsets of movable objects. Relaxation in mPG and Ultrafast VPR is by fast annealing. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

402



J. Cong et al.

The mPG framework employs a fixed set of hierarchical bin-density constraints to monitor area and routing congestion. In mPL, relaxation at intermediate levels proceeds both by (i) quadratic wirelength minimization on small subsets followed by path-based area-congestion relief [Hur and Lillis 2000] and (ii) randomized, greedy, and discrete Goto-based cell swapping [Goto 1981]. 3.2.4 Interpolation. Simple declustering and linear assignment can be effective [Chan et al. 2000]. With this approach, each component cluster is initially placed at the center of its parent’s location. If an overlap-free configuration is needed, a uniform bin grid can be laid down, and clusters can be assigned to nearby bins or sets of bins. The complexity of this assignment can be reduced by first partitioning clusters into smaller windows, for example, of 500 clusters each. If clusters can be assumed to have uniform size, then fast linear assignment can be used. Otherwise, approximation heuristics are needed. Under AMG-style weighted disaggregation, interpolation proceeds by weighted averaging: each finer-level cluster is initially placed at the weighted average of the positions of all coarser-level clusters with which its connection is sufficiently strong [Chan et al. 2003b]. Finer-level connections can also be used: once a finer-level cluster is placed, it can be treated as a fixed, coarserlevel cluster for the purpose of placing subsequent finer-level clusters. Weighted disaggregation is described further in Section 3.2 below. A constructive approach, as in Ultrafast VPR [Sankar and Rose 1999], can also lead to extremely fast and scalable algorithms. At each level, clusters are initially placed in the following sequence: (i) clusters directly connected to output pads, (ii) clusters directly connected to input pads, (iii) other clusters. Example: mPL. To provide a concrete example of an implementation of placement by multilevel optimization, we briefly describe the mPL package [Chan et al. 2000, 2003a, 2003b]. mPL began as an attempt at scalable placement by efficient nonlinear-programming. Early experiments, however, showed that requiring descent at each iteration forced step sizes to be prohibitively small on problems with more than a few hundred variables. The complexity of the O(N 2 ) nonconvex nonoverlap constraints rendered pointwise approximations useful only in microscopic neighborhoods of their evaluation points. Multiscale optimization was used to overcome this complexity barrier. An early fast and scalable formulation was produced relatively easily, at some cost in wirelength compared to leading methods. Subsequent work has made mPL competitive in both run-time and quality of result, without loss of scalability. Coarsening. mPL builds its hierarchy of problem scales by recursive firstchoice clustering [Karypis 1999]. The mPL-FC affinity that vertex i has for vertex j is  w(e) ri j = , (1) (|e| − 1)area(e) {e∈E | i, j ∈e} where w(e) is the weight assigned to hyperedge e, area(e) denotes the sum of the areas of the vertices in e, and |e| denotes the number of vertices in hyperedge e, viz., the degree of e. Dividing by |e| helps eliminate small hyperedges at coarser ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



403

levels, making coarse-level netlists sparser and hence easier to place [Karypis 1999; Sankar and Rose 1999; Hu and Marek-Sadowska 2003]. Dividing by area(e) gives greater affinity to smaller cells and thus helps keep cluster areas commensurate. A relatively uniform cluster-area distribution improves the performance of the nonlinear-programming and slot-assignment modules discussed below. Given a vertex i at the finer level, the vertex j assigned to it has least hyperedge degree among those vertices within 10% of i’s maximum FC affinity. When this choice is not unique, a least-area vertex is selected from the least-degree candidates. Hyperedges are defined at the coarser level simply by replacing the elements of the finer-level hyperedge e = {e1 , . . . , ek } by their corresponding clusters: e¯ = {c(e1 ), . . . , c(ek )}, where duplicate clusters are of course removed. Hence, hyperedge degrees decrease during coarsening, and many hyperedges eventually become singletons at some level, where they are ignored. Relaxation. A customized nonlinear-programming solver is used at mPL’s coarsest level—500 cells or fewer, by default—to obtain an initial solution. Relaxation at all other levels is restricted to sweeps of local refinements on subsets. All relaxations are combined or alternated with techniques for spreading cells out to obtain a sufficiently uniform cell-area distribution at each level. The nonlinear-programming (NLP) formulation employs simplified, smoothed objective and constraint functions. At the coarsest level, clusters vi and v j are modeled as disks. Let x and y denote vectors containing the respective x and y coordinates of all cells and pads (assume pin locations are at cell centers). Pairwise nonoverlap constraints ci j (x, y) are directly expressed in terms of disk radii ρi and ρ j ; ci j (x, y) = (xi − x j )2 + ( y i − y j )2 − (ρi + ρ j ) ≥ 0

for all i < j.

Quadratic wirelength over a clique-model graph netlist approximation,  q(x, y) = γi j ((xi − x j )2 + ( y i − y j )2 ),

(2)

i, j

is minimized subject to the pairwise nonoverlap constraints (for efficiency, large nets are modeled as chains rather than cliques). The NLP solver is a customized interior-point method. In order that overlap can be removed gradually, a slack variable ξ is added to both the objective and the constraints, as follows: min x, y,ξ

f (x, y) + αξ

subject to ci j (x, y) + ξ ≥ 0. The penalty weight factor α is gradually increased to remove overlap. After nonlinear programming and the QRS local-relaxation sweeps described below, bin-area densities are balanced by displacement-minimizing linear assignment of clusters to bin locations. Discrete Goto-based swaps are then employed as described below to further reduce wirelength prior to interpolation to the next level. For scalability, global relaxations (in which all variables are simultaneously modified) are avoided at all levels except the coarsest. Instead, two separate ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

404



J. Cong et al.

sweeps of iterative improvement over small subsets of vertices are performed. A uniform grid is used to monitor the area-density distribution. The first of these local-relaxation sweeps is called quadratic relaxation on subsets (QRS). Traversing the netlist in simple depth-first search (DFS) order, it selects movable vertices in small batches. For each batch, the quadratic wirelength of all nets containing at least one movable vertex, viewed as a continuous function of the movable vertex locations, is minimized without regard to overlap. Each such relocation typically introduces additional area congestion and is therefore followed directly by a clean-up step to keep the area-density distribution consistent. For this purpose, the “ripple-move” algorithm [Hur and Lillis 2000] is applied to any overfull bins after QRS on each batch. Ripple-move computes a max-gain monotone path of vertex swaps along a chain of bins leading from an overfull bin to an underfull bin. To facilitate the area-congestion control, only very small batches of movable cells are used in QRS; the batch size is set to three in the reported experiments. After the entire sweep of QRS+ripple-move, a sweep of discrete, Goto-style permutations [Goto 1981] further reduces total wirelength. Vertices are visited one at a time in netlist order. The optimal “Goto” location of a given vertex a is computed by minimizing the sum of the bounding-box lengths of all nets containing a while holding all neighbors of a fixed. If a’s Goto location is occupied by b, say, then b’s optimal Goto location is similarly computed along with the optimal Goto locations of all of b’s nearest neighbors. The computations are repeated at each of these target locations and their nearest neighbors up to a predetermined limit (3–5). Chains of swaps are examined by moving a to some location in the Manhattan unit-disk centered at b, and moving the vertex at that location to some location in the Manhattan unit disk centered at its Goto location, and so on. The last vertex in the chain is then forced into a’s original location. If the best such chain of swaps reduces wirelength, it is accepted; otherwise, the search begins anew at another vertex. To prevent corruption of a given cell-area distribution, swapping a large cell with a much smaller cell is explicitly disallowed. Smoothing the distribution of cell area in the presence of widely varying cell or cluster sizes has particular importance. Interestingly, mPL ignores area variations among clusters at all coarser levels. That is, at every level except the finest, each cluster’s area is set to the average of all cluster areas at that level. The reasons for this assumption’s effectiveness are not completely understood. It is not used at the finest level, however, where larger-than-average cells are chopped into average-size fragments. After linear assignment of the cells and cell fragments to finest-level bins, fragments of the same cell are explicitly reunited. Any resulting area overflow is then removed by ripple-move cell propagation as described above. Interpolation. mPL employs AMG-based weighted disaggregation in interpolating solutions from level to level. For each cluster at the coarser level, a single, “C-point” representative component is selected from it for use as a fixed anchor. C-points are selected for maximal netlist degree and large area (area is used only if the maximum-degree vertex is not unique). C-points are locked in place at their parent clusters’ centers. The remaining “F-point” vertices in the ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



405

Fig. 5. Some iteration flows for multilevel optimization.

cluster are ordered by nonincreasing weighted hyperedge degree. Following this order, they are then placed one by one at the weighted averages of their strong C-point neighbors and strong F-point neighbors already placed. This F-point repositioning is iterated a few times, but the C-points are held fixed throughout. Iteration Flow. The idea behind the F-cycle shown in Figure 5 is that the accuracy of a coarse-level solution can be enhanced by recursively applying the multilevel flow to it before interpolation. Although the F-cycle flow does not compromise scalabilty, assuming linear-order relaxations, it may increase run time considerably. In mPL, two backtracking V-cycles are used as a compromise— instead of descending all the way back to the coarsest level after each interpolation, mPL backtracks just one level before continuing toward finer levels. The first backtracking V-cycle follows the connectivity-based FC clustering hierarchy described in the coarsening section above. The second backtracking V-cycle follows a different FC-cluster hierarchy, in which both connectivity and proximity are used to calculate vertex affinities:  w(e) ri j = . (|e| − 1)area(e)||(x i , y i ) − (x j , y j )|| {e∈E | i, j ∈e}

During this second pass of clustering, vertex positions calculated in the first backtracking V-cycle are preserved by placing clusters at the weighted averages of their component vertices’ positions. In the interpolation phase, however, the new clustering supports exploration of new territory in the solution space, enabling the flow to improve the result. 3.3 Embedded Multilevel Optimization Leading algorithms owe their performance not only to their basic, outer structure but also to sophisticated and hierarchical iterative internal calculation. In particular, all leading contemporary partitioning-based methods as described in Section 3.1.1, including Capo [Caldwell et al. 2000b], Dragon [Sarrafzadeh et al. 2002], and Feng Shui [Yildiz and Madden 2001a; Agnihotri et al. 2003; Khatkhate et al. 2004], rely heavily on multilevel algorithms for netlist partitioning. Although each partitioning is, as a component of the placement algorithm, performed on individual cells rather than on aggregates of cells, the partitioning algorithm itself is multilevel. That is, a hierarchy of aggregates of cells ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

406



J. Cong et al.

is formed by recursive clustering, each of the resulting levels is partitioned by FM, starting with the coarsest level, proceeding in sequence to the finest level, the solution at each level defining a starting point for iterative improvement at the next. Although no explicit use is made of the multilevel cluster hierarchy once partitioning is completed, it seems clear that the multilevel partitioners play an enabling role in these methods. In fact, a placement problem of order 106 cells and nets can still be solved “flat,” i.e., without any explicit aggregation or partitioning, provided that sufficiently fast and scalable numerical solvers are available for the given formulation. Two clear demonstrations of this approach are found in recent adaptations of force-directed methods [Quinn and Breuer 1979]. An AMG-Accelerated Force-Directed Method. Seminal work by Eisenmann and Johannes [1998] formulated placement as a sequence of unconstrained quadratic minimizations. The objective function 1 T (x Q x + y T Q y) + bxT x + bTy y + f xT x + f yT y 2 captures both netlist connectivity and area congestion by a graph approximation and force-field calculation, as follows. Cell-to-cell connections determine the off-diagonal entries and part of the diagonal entries in the fixed graph Laplacian matrix Q by means of a quadratic star-wirelength model [Kleinhans et al. 1991]. Cell-to-pad connections contribute to the diagonal elements of Q, rendering it positive definite, and determine the linear-term coefficients in the right-hand-side vector b = (bx , b y ). Viewing this vector b as external springlike forces following Hooke’s law, the circuit connectivity is represented by the (constant) symmetric-positive-definite matrix Q and the vector b. The perturbation vector f = ( f x , f y ) represents global area-distribution forces analogous to electrostatic repulsion, with cell area playing the role of electric charge. At each iteration, vector f is recalculated from the current cell positions by means of a fast Poisson-equation solver. Since Q does not change from one iteration to the next unless nets are reweighted, a hierarchical set of approximations to Q can typically be reused over several iterations. We refer to this approach as Poisson-based. Recently, a customized AMG-based linear-system solver was derived for iterated force-directed quadratic-wirelength minimization [Chen et al. 2003]. The motivation is simply to improve the scalability of the Poisson-based approach, the run time of which is dominated by linear-system solves. The AMG approach to linear systems proceeds by repeatedly applying a simple relaxation update to the approximate solution at each level of an aggregation hierarchy. The relaxation is typically componentwise (Jacobi, Gauss-Seidel, SOR, etc.): the kth  equation of the system j ai j x j = bi , i = 1, . . . , n is used to express the kth coordinate xk of the solution in terms of the others, xi , i = k, which are temporarily held fixed.2 Coordinates are updated in turn, one at a time, in a sweep across the equations (for example, Golub and Van Loan [1989]). With a well q(x, y) =

2 Since

the system matrix for the force-directed approach is positive definite, its diagonal elements  are strictly positive, and the expression xk = (bk − j =k ai j x j )/akk is well defined. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



407

constructed hierarchy, solution of the optimality conditions Q x = −(bx + f x ) and

Q y = −(b y + f y )

(3)

can be done in this way in linear or nearly linear amortized time. In the work of Chen et al. [2003], the hierarchy is derived by strict aggregation from four separate preliminary layouts. Several iterations of the SOR variant of coordinatewise relaxation are applied to the flat, unconstrained (i.e., ignoring overlap) quadratic over four separate trials. In each trial, the cells are initially placed all at the same point: one of the four corners of the placement region. Clusters are selected according to cells’ average final proximity over the results of all four trials. Although this iterative, empirical approach to clustering requires significant run time, it is a fixed initial cost that can be amortized over the cost of subsequent iterations. Numerical results confirm the scalability of this approach. 3.4 Advances in Analytical Placement As the size and complexity of placement instances continue to increase, continuous approximation becomes ever more effective. The impact of algorithm enhancements also becomes more noticeable at larger scales. Several recent papers have introduced novel and intriguing variations of basic analytical frameworks that consistently improve performance. Three of them are summarized here. 3.4.1 FastPlace. Recent work of Chu and Viswanathan [2004] demonstrates that placements comparable in total wirelength to those of leading available academic tools can be computed in orders of magnitude less run time. Experiments comparing FastPlace to CAPO [Caldwell et al. 2000b] and Dragon [Sarrafzadeh et al. 2002] demonstrate speed-up factors of 20×—100× and relative differences in total wirelengths within 1–10%. The FastPlace flow combines simple, local and global heuristics in a way that supports fast computation and convergence. It repeats the following four steps until the movable cells are distributed evenly enough that legalization and detailed placement can be applied. (1) (2) (3) (4)

Minimize unconstrained modified quadratic wirelength. Shift cells locally to relieve high-density regions. Move cells one by one to reduce linear half-perimeter wirelength (HPWL). Calculate displacement forces corresponding to the displacements in steps (2) and (3) and impose them cell by cell, by means of pseudo-nets connected to pseudo-pads, modifying the quadratic objective accordingly.

As the initial unconstrained quadratic tends to clump cells in the center of the region, cells flow generally from the center toward the boundary over the 20–30 iterations needed to converge. Local cell shifting proceeds as follows: A uniform bin grid is laid over the region. The grid is sized so that the average number of cells in a bin is about 4. Bins are expanded or contracted independently in the horizontal and vertical ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

408



J. Cong et al.

directions in proportion to their cell-area content (the resulting bins are no longer disjoint). Cells are moved with their bins during the bin resizing in proportion to the stretch or shrink. Local HPWL refinement proceeds greedily, cell by cell. In contrast to the force-directed methods described in Section 3.3, FastPlace computes the forces needed to generate cell displacements already computed by other, local means. It imposes the forces only after making the displacements, incorporating them into the global quadratic objective as 2-pin pseudo-nets connecting cells directly to pseudo-pads placed along the chip boundary in the direction of the desired forces. These artificial forces prevent cells from collapsing back to their previous positions at the next iteration. The total number of iterations and the displacement per iteration are thus tightly controlled. Interestingly, no hierarchy is used in FastPlace as originally published; all linear-systems are solved by preconditioned conjugate gradients (PCG) with generic ILU preconditioning [Saad 1996]. Technically this approach, as published, is not scalable, due to the larger number of iterations of PCG needed to solve (3) at the early placement iterations. As iterations proceed, however, the artificial-pad forces accelerate the convergence of the PCG iterations. The force associated with each cell displacement is imposed by connecting the cell to an artificial pad on the placement region boundary and weighting the connection to generate the displacement. Derivation of the linear system coefficients shows that this technique adds positive numbers to matrix diagonal elements. Because the quadratic wirelength model simulates a Hooke’s-Law spring system, forces at later iterations must be stronger than at earlier iterations. Hence, the diagonal elements of the linear system (3) increase, making the system easier to accurately precondition, making the ILU preconditioner more effective, and reducing the number of iterations necessary for PCG to converge. Scalability can be obtained simply by applying an AMG-based solver at earlier iterations, as described above in the previous section. 3.4.2 Grid Warping. Following Proud [Tsay et al. 1988], most analytical placers, including Gordian [Kleinhans et al. 1991; Sigl et al. 1991] and FastPlace (Section 3.4.1 above), start from the premise that an unconstrained solution, that is, one obtained by minimizing wirelength without regard to nonoverlap or any other constraints, provides a high-quality relative ordering of the placeable objects. Although connections to fixed terminals generally make the optimal unconstrained cells locations distinct, the unconstrained solution is still, typically, extremely nonuniform and is not easily legalized in any way that approximately preserves the cells’ relative orderings. What most distinguishes different analytical approaches is the manner in which they spread the cells from an initially highly nonuniform distribution to one uniform enough to be legalized without major perturbations. The novelty of grid warping [Xiu et al. 2004] is that, rather than directly move cells based on their distribution, it uses the cells to deform or warp the region in which they lie, in an analogy with gravity as described by Einstein’s general relativity. The inverse of the deformation is then used to carry cells from their original locations to a more uniform distribution. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



409

Fig. 6. Prewarping. The gridlines of a uniform bin grid (a) are translated so as to capture roughly equal numbers of cells in each row and column (b). The inverse translation is applied to cells as well as gridlines, giving a more even distribution of the cells over the region (c).

Fig. 7. Warping. Each bin of a uniform bin grid (a) is mapped to a corresponding quadrilateral in an oblique but slicing bin structure (b) so as to capture roughly equal numbers of cells in each quadrilateral. The inverse bin maps are applied to the cells in order to spread them out (c).

Two variants of the basic concept are shown in Figures 6 and 7. In Figure 6, a nonuniform rectilinear grid is defined so that each of its rows and columns contains approximately the same amount of total cell area. This simplified form of grid modification, in which all gridlines remain parallel to coordinate axes, is called “prewarping” by Xiu et al., because they find it useful as a fast, first step in the spreading process. Figure 7 illustrates the more general approach. As shown, oblique grid lines are used, and although a slicing structure with alternating cutline directions and quadrilateral bins is maintained, gridlines not necessary to the slicing pattern are broken at points where they intersect other gridlines. This weakening of the grid structure allows close neighbors in the original unconstrained placement to be separated a relatively large distance by the warping. The warp is defined as a collection of bilinear maps, one from each of the bins in the original, uniform grid to the corresponding quadrilateral in the warped grid. To invert the warp and move the cells, the inverse of each such bilinear map is applied to the coordinates of all the cells in its quadrilateral bin. A block-scanline approximation algorithm is used for fast determination of which cells are contained in which quadrilateral. The grid points of the warped grid are determined simultaneously by a derivative-free method of nonlinear optimization of Brent and Powell. The ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

410



J. Cong et al.

required top-down slicing grid structure is maintained by (a) fixing the alternating cutline direction order a priori, by deciding whether to orient the first cut from top to bottom or from side to side, and (b) expressing each cutline after the first in terms of 2 variables, one for where it intersects its parent cutline, and another for where it intersects the opposite boundary or cutline. A penalty function f is used as the objective: f = wirelength + ρ ·



βi j ,

bins

where βi j is approximately the square of the difference between the total cell area in bin (i, j ) and the target cell area κ = κ(i, j ) for each bin. Wirelength is the total weighted half-perimeter wirelength obtained after the inverse warp. Although evaluating the objective is fairly costly, the number of variables in the optimization is low—only 6 for a 2 × 2 grid or 30 for a 4 × 4 grid—and convergence is fast. To obtain an effective and scalable placement algorithm, this basic spreading operation must still be incorporated within some hierarchical framework, for example, top-down partitioning or multilevel optimization. Xiu et al. [2004] use top-down partitioning, with the partitions defined by grid warping. That is, each step of grid warping defines a partition of both cells and space. Cutsize-driven partitioning is used to separate cells lying on grid boundaries. Prewarping and warping recurs on the subregions defined by the grids. The overall flow is summarized below. (1) Unconstrained quadratic-wirelength placement. (2) Redistribute cells by prewarping (e.g., with an 8 × 8 grid). (3) Redistribute cells by linear-wirelength optimizing grid warping (e.g., with a 4 × 4 grid). (4) Assign cells to grid bins, using cut-size driven partitioning only on cells near bin boundaries. (5) Propagating terminals to bin boundaries, recur prewarping, warping, and cell partitioning separately on the cells in each bin, until the overall cell distribution is even enough for legalization. 3.4.3 Multilevel Generalized Force-Directed Placement. While the forcedirected framework described in Section 3.3 has broad appeal for its generality and scalability, considerable effort is needed to produce a fast and stable implementation [Vorwerk et al. 2004]. Recently, this framework has been generalized to a more rigorous mathematical formulation and adapted to a multilevel implementation in mPL5 [Chan et al. 2005]. An overview of the mPL5 formulation is given here. Placement objectives and constraints at each level of a cluster hierarchy (Section 3.2) are approximated by smooth functions. A bounding-box weighted wirelength objective is approximated by the log-sum-exp model ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



[Naylor et al. 2001; Kahng and Wang 2004]    exp(−xk /γ ) exp(xk /γ ) + log (log W (x, y) = γ vk ∈e nets e∈E nodes vk ∈e   exp(− y k /γ )), exp( y k /γ ) + log + log vk ∈e

411

(4)

vk ∈e

where x and y denote vectors of cell x- and y-coordinates. The smaller the parameter γ , the more accurate the approximation. Letting Di j denote the cellarea density of bin Bi j and K the total cell area divided by the total placement area, the area-density constraints are initially expressed simply as Di j = K over all bins Bi j . Viewing the Di j as a discretization of the smooth density function d (x, y), these constraints are smoothed by approximating d by the solution ψ to the Helmholtz equation  ψ(x, y) − ǫψ(x, y) = d (x, y), (x, y) ∈ R (5) ∂ψ = 0, (x, y) ∈ ∂ R ∂ν where ǫ > 0, ν is the outer unit normal, ∂ R is the boundary of the placement region R, d (x, y) is the continuous density function at a point (x, y) ∈ R, and 2 2  is the Laplacian operator  ≡ ∂∂x 2 + ∂∂y 2 . The smoothing operator −1 ǫ d (x, y) defined by solving (5) is well defined, because (5) has a unique solution for any ǫ > 0. Since the solution of (5) has two more derivatives [Evans 2002] than d (x, y), ψ is a smoothed version of d . Discretized versions of (5) can be solved rapidly by fast numerical multilevel methods. Recasting the density constraints as a discretization of ψ gives the nonlinear programming problem min W ( x, y) s.t. ψi j = −K /ǫ,

1 ≤ i ≤ m, 1 ≤ j ≤ n,

(6)

where the ψi j are obtained by solving (5) with the discretization defined by the given bin grid. Interpolation from the adjacent coarser level (Section 3.2) defines a starting point. This nonlinear-programming problem is solved by the Uzawa iterative algorithm [Arrow et al. 1958], which does not require second derivatives or large linear-system solves:  ∇W (xk+1 ,yk+1 ) + λikj ∇ψi j = 0 i, j (7) k ¯ λik+1 = λ + α(ψ i j + K /ǫ) j ij where λ is the Lagrange multiplier, λ0 = 0, α is a parameter to control the rate of convergence, and gradients of ψi j are approximated by simple forward finite ψ −ψ ψ −ψ differences ∇xk ψi j = i, j +1hx i, j , ∇ yk ψi j = i+1,hj y i, j when the center of cell vk is inside Bi j and are set to zero otherwise. The nonlinear equation for (xk+1 , yk+1 ) is recast as an ordinary differential equation and solved by an explicit Euler method [Morton and Mayers 1994]. Multiscaling the generalized force-directed algorithm renders it much more scalable. Enabling efficient, global analytical relaxations in the multilevel framework dramatically improves placement quality. Compared with other leading academic tools Dragon 3.01, Capo 8.8, Feng-Shui 2.6 and FastPlace 1.0 on standard IBM-ISPD98 benchmark circuits, mPL5’s average wirelength ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

412



J. Cong et al.

is the shortest—1% shorter than Dragon’s, 10% shorter than Capo’s, 4% shorter than Feng-Shui’s, and 11% shorter than FastPlace’s. The run time of mPL5 is also very competitive: 9× faster than Dragon’s, about the same as Capo’s, 2× faster than Feng-Shui’s, and 8× slower than FastPlace’s. A fast mode of mPL5 gives wirelength roughly between that of Feng-Shui and Capo with run time only 2× longer than that of FastPlace. 3.5 Legalization and Detailed Placement While most research in placement is still directed at global placement, some recent work on the PEKO benchmarks (Section 2) suggests that existing globalplacement algorithms already place nearly every cell within a few cell-widths of its globally optimal location [Ono and Madden 2005]. This observation has renewed interest in improved methods for legalization [Li and Koh 2003; Agnihotri et al. 2003; Khatkhate et al. 2004] and detailed placement [Brenner et al. 2004; Ramachandaran et al. 2005]. Nevertheless, progress in global placement still continues to be made. The search for the most effective means of legalizing and refining a good global placement plays a critical role in current efforts to further reduce the gap between achievable and optimal placements.

4. TIMING OPTIMIZATION Extensive research on timing-driven placement has been done in the past two decades and continues today. The performance of a circuit is determined by its longest path delay, but timing constraints are extremely complex. The number of paths present grows exponentially with circuit size. Even a circuit of modest size can have a huge number of paths. For example, Chang et al. [1994] estimated the number of path constraints in a 5K-cell design to be around 245K, requiring roughly 243Mb memory space if stored explicitly. Moreover, users may have different requirements for different paths. For example, a circuit may have different tsu (input to register), tco (register to clock output), r2r (register to register) or i2o (input to output) requirements for individual nodes, or paths. The existence of multiple clock domains and multiple cycle paths makes the problem even more complicated. Existing timing-driven placement algorithms can be broadly classified into two categories: path-based and net-based. 4.1 Path-based Algorithms Path-based algorithms try to directly minimize the longest path delay. Popular approaches in this category include mathematical programming and iterative critical path estimation. Formulation of the problem as a linear or nonlinear programming problem typically introduces auxiliary variables (i.e., arrival times) at circuit nodes [Jackson and Kuh 1989; Srinivasan et al. 1991; Hamada et al. 1993]. Different mathematical programming techniques can then be used to solve the problem. For example, in terms of arrival time a(i) at pin i, timing constraints ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



413

may be expressed as follows. a( j ) ≥ a(i) + d (i, j ) ∀(i, j ) ∈ G a( j ) ≤ T ∀j ∈ PO a(i) = 0 ∀i ∈ P I, where G PI PO d (i, j ) T

denotes the timing graph. denotes the set of starting points of any timing path, including primary input pins and output pins of memory elements. denotes the set of ending points of any timing path, including primary output pins and data input pins of memory elements. denotes the delay of timing arc (i, j ), which is either a constant (for cell internal delay) or a function of cell locations. denotes the longest path-delay target.

Here we assume that the arrival time at all P I pins is zero, and that all P O pins have the same delay targets. Simple changes can be made to the formula to accomodate more complex situations. Explicitly minimizing the sum of the lengths of the paths in some set of critical paths is a popular approach. This set of critical paths can be precomputed in a static manner or dynamically adjusted from iteration to iteration. TimberWolf [Swartz and Sechen 1995] used simulated annealing to minimize a set of pre-specified timing-critical paths, while mathematical programming techniques [Burstein and Youssef 1985; Marek-Sadowska and Lin 1989] have also been employed. The advantage of path-based algorithms is their accurate timing view during the optimization procedure. However, the drawback is that they usually require substantial computation resources due to the exponential number of paths which need to be simultaneously minimized. Moreover, in certain placement frameworks, for example, top-down partitioning, it is very difficult or infeasible to maintain an accurate global timing view. 4.2 Net-based Algorithms Net-based algorithms [Dunlop et al. 1984; Nair et al. 1989; Tsay and Koehl 1991; Eisenmann and Johannes 1998], on the contrary, do not directly enforce path-based constraints. Instead, timing constraints or requirements on paths are transformed into either net-length constraints or net weights. This information is then fed to a weighted-wirelength-minimization-based placement engine to obtain a new placement with better timing. This new placement is then analyzed by a static analyzer, thus generating a new set of timing information to guide the next placement iteration. Usually this process must be repeated for a few iterations until no improvement can be made or until a certain iteration limit has been reached. The process of generating net-length constraints or net-delay constraints is called delay budgeting Hauge et al. 1987; Gao et al. 1991; Luk 1991; Youssef et al. 1992; Tellez et al. 1996; Sarrafzadeh et al. 1997a, 1997b; Chen et al. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

414



J. Cong et al.

2000]. The main idea is to distribute slacks at the endpoints of each path (POs or inputs of memory elements) to constituent nets in the path such that a zeroslack solution is obtained [Nair et al. 1989; Youssef and Shragowitz 1990; Chen et al. 2000]. The original zero-slack algorithm (ZSA) [Nair et al. 1989] assigns slacks based mainly on fanout factors. Subsequently, researchers considered a more general framework [Sarrafzadeh et al. 1997b] in which delay budgeting is formulated as follows.3 max such that



(i, j )∈G

Ci j (si j )

si j = a( j ) − a(i) − d (i, j ) ∀(i, j ) ∈ G a( j ) ≤ T ∀j ∈ PO a(i) = 0 ∀i ∈ P I Here si j is slack for edge (i, j ), and Ci j (si j ) is a flexibility function4 of edge (i, j ). The intuition is that, since delay budgeting will generate a set of constraints for the placement, these constraints are stated as weakly as possible, in order to minimize their impact on solution quality. A serious drawback of this class of algorithms is that delay budgeting is usually done in the circuit’s structural domain, without consideration of physical placement feasibility. There is no generally conceived good flexibility function. As a result, it may severely overconstrain the placement problem. Recently, some attempts have been made to unify delay budgeting and placement [Sarrafzadeh et al. 1997a; Yang et al. 2002a; Halpin et al. 2001], where a complete or coarse [Yang et al. 2002a; Halpin et al. 2001] placement solution is used to guide the delay budgeting step. However, it is generally difficult to find an efficient or scalable algorithm for such unification. To overcome these problems, approaches based on net weighting use different means. Instead of assigning a delay budget to each individual net or edge, net-weighting-based approaches assign weights to nets based on their timing criticality. Compared with delay-budgeting approaches, these methods will not suffer from the overconstraining problem. Net-weighting-based algorithms are generally very flexible. They can be integrated naturally into an existing wirelength-minimization-based placement framework. They also have a relatively low complexity. As circuit sizes continue to increase and practical timing constraints become increasingly complex, these advantages make the net-weighting-based approaches more and more attractive. Consider the following simple example. Suppose one circuit contains only one timing path P , which consists of the following two-pin connections: e1 , e2 , . . . , en . Let d (ei ) denote the delay of edge ei . Using a net-weighting approach, we assign the weight of each edge the same value, say 1 (as all edges are in the same path, they have the same criticality). We can transform the path minimization

3 The 4 This

formulation has been modified to fit the context here. concept is not explicitly used in the original paper.

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



415

problem exactly into the following problem: n  min d (ei ). i=1

However, under a delay-budgeting approach, we first need to find a delay budget b(ei ) for each edge, then solve the following constrained optimization problem: min f (x) such that d (ei ) ≤ b(ei ), ∀1 ≤ i ≤ n

The objective function f (x) is usually wirelength; timing has been relegated to the constraints. It is obvious that this method suffers from the overconstraining problem: if we use a nonoptimal delay budget, there is no guarantee we can find a solution as good as that obtained under a net-weighting approach. An optimal delay budget is very difficult to compute without solving the placement problem. Even if an optimal delay budget can be obtained, there may exist many optimal delay budgets whose resulted timing constraint sets may differ wildly. Some may be very tight, others may be very loose, and predicting which are tight or loose may be very difficult. Unfortunately, despite its advantages, net weighting is usually done in an ad-hoc, intuitive manner. The main principle used in most algorithms is that a timing-critical net should receive a heavy weight. For example, VPR [Marquardt et al. 2000] used the following formula to assign weight to an edge e: w(e) = (1 − slack(e)/T )α where T is the current longest path delay and α is a constant. These methods ignored another important principle – path sharing. In general, an edge with many paths passing through it should receive a heavy weight as well. Path counting is a method developed to take path-sharing effects into consideration by computing the number of paths passing through each edge in the circuit. These numbers can then be used as edge weights. Unfortunately, the naive approach suffers from a severe drawback: it cannot distinguish timing-critical paths from non-critical paths. The variant ǫ-network path counting [Senn et al. 2002] suffers from the same problem [Kong 2002]. A nice solution has recently been proposed [Kong 2002]. The algorithm, named PATH, can properly scale the impact of all paths by their relative timing criticalities (measured by their slacks) respectively, instead of counting critical paths and non-critical paths with equal weight. It is shown [Kong 2002] that for certain discount functions, this method is equivalent to enumerating all the paths in the circuit, counting their weights, and then distributing the weights to all edges in the circuit. Such computation can be carried out very efficiently in linear time, and experimental results have confirmed its effectiveness. Compared with VPR [Marquardt et al. 2000] under the same placement framework, PATH reduces the longest path delay by 15.6% on average with no runtime overhead and only a 4.1% increase in total wirelength. Like a standard static timing-analysis algorithm, the PATH algorithm works in two phases. In the first, forward phase, a forward partial counting is conducted, while in the second, backward phase, a backward partial counting is ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

416



J. Cong et al. Table I. The PATH Algorithm Accurately Computes the Impact of All Paths Passing Through Each Edge 1. 2. 3. 4. 5. 6. 7. 10. 11. 12. 13. 14. 15. 16. 19. 20. 21.

set F ( p) = B( p) = 0 for each pin p; for each P I pin p, set F ( p) = 1; traverse pins t in topological order for each input pin s of t Fs (s, t) = a(t) − a(s) − d (s, t); compute d iscount = D(Fs (s, t), T ); F (t) + = d iscount × F (s); for each P O pin p, set B( p) = 1; traverse pins s in reverse topological order for each output pin t of s Bs (s, t) = r(t) − d (s, t) − r(s); compute d iscount = D(Bs (s, t), T ); B(s) + = d iscount × B(t); for each edge (s, t), compute AP(s, t) = F (s) × B(t) × D(sl ack(s, t), T );

Here r(t) represents the required arrival time at pin t, D(x) represents the weighting function (called the discount function).

performed. PATH maintains two counters for each pin p: F ( p), the forward partial counter, and B( p), the backward partial counter. For each edge (s, t), PATH also maintains a counter AP (s, t) for the total number of weighted paths passing through it. A brief description of the algorithm is listed in Table I. It is interesting to note that if we use a trivial discount function D(x) = 1, we will get the total number of paths passing through each edge in the timing graph. A potential problem in net-based approaches is the so-called oscillation problem. Usually net weights or budgets are assigned by performing timing analysis for some given placement solution P n at the nth iteration. More critical nets receive higher weights. Thus, in the next placement solution P n+1 , the lengths of critical nets in P n will be reduced, while the lengths of other noncritical nets are potentially increased, resulting in changes in net criticalities, and, thus, in net weights. If a net alternates between critical and noncritical, its length may alternately increase and decrease, impeding convergence. Certain path-based approaches suffer from similar problems, for example, a need to dynamically adjust the set of paths being optimized [Swartz and Sechen 1995]. Two ways to eliminate the oscillation problem appear widely in the literature. The first approach is to perform timing analysis and recompute net weighting periodically. VPR [Marquardt et al. 2000] and PATH [Kong 2002] follow this approach. Based on simulated annealing, both methods perform timing analysis and net re-weighting once per temperature. The second approach is to make use of historic information [Eisenmann and Johannes 1998], that is, to combine weights in previous iterations with criticality information in the current placement to derive the current weights. Intuitively, if a net is always critical during all placement iterations, we want to gradually increase its weight; while if it is never critical, we will gradually decrease its weight. As a typical example, ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



417

we consider the approach used by Eisenmann and Johannes [1998]. At each iteration m, each net e has criticality cm (e), initialized to zero and updated as follows:  (cm−1 (e) + 1)/2 if e is among 3% most critical cm (e) = cm−1 (e)/2 otherwise. Weight wm (e) is initialized to one and updated as wm (e) = wm−1 (e)(1 + cm (e)). The general approaches described above can be viewed as iteratively targeting the worst negative slack (WNS). Rather than focus only on the most critical path, it is possible to consider the sum of all slacks over all paths. In this approach, different paths have different delay targets and thus different required arrival times. The timing constraints for a given path are satisfied at a point i if the slack at i is sufficiently positive. More recently, a sensitivity-guided netweighting strategy based on minimizing total slack over all paths have been proposed [Ren et al. 2004]. In this approach, nets are targeted for their impact on the global objective and not necessarily for their criticality. 5. ROUTABILITY OPTIMIZATION Routing congestion is one of the fundamental issues in VLSI physical design. Because an aggressive wirelength-driven placement may not be routable, routability is best considered directly during the placement phase. Routabilitydriven placement involves mainly (i) routability modeling and (ii) optimization techniques for routability control. Usually optimization for routability control is performed based on the estimated routing congestion of a placement configuration. We discuss these two issues in the following subsections. 5.1 Routability Modeling Routability is usually modeled on an X ×Y global-routing grid in the chip’s core region. Routing supply and demand are modeled for each bin and/or each boundary of the routing grid structure. There are two major categories on routability modeling: topology-free (TP-free), where no explicit routing is done, and topology-based (TP-based), where routing trees are explicitly constructed on some routing grid. 5.1.1 Topology-free Modeling. TP-free modeling is faster in general. Examples of this class include bounding-box (BBOX)-based modeling [Cheng 1994], probabilistic analysis-based modeling [Lou et al. 2002; Westra et al. 2004], Rent’s rule-based modeling [Yang et al. 2002b], and pin density-based modeling [Brenner and Rohe 2003]. In RISA modeling [Cheng 1994], the routing supply for each bin in the routing grid structure is modeled according to how the existing wiring of power or clock nets, regular cells, and macros (macros are referred to as “mega cells” [Cheng 1994]) are placed, and the routing demand of a net is modeled by its weighted BBOX length. Let Tv and Th denote the total number of full tracks available in both vertical and horizontal directions over the core area including all metal ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

418



J. Cong et al.

Fig. 8. RISA modeling for routing supply of each bin with existing wiring (a), regular cell (b), and macro(c).

layers. Given a global-routing bin structure of X × Y bins, the initial routing supply of a bin (i, j ) is Th Tv , Sihj = . X Y Existing wiring of power and clock nets, regular cells, and macros are considered to an obstacle to routing; thus, the routing resource supply decreases when any of these elements is found in a bin. Given a bin (i, j ) of width W and length L, when existing wiring on layer m is found in it as shown in Figure 8(a), the routing supply of the bin decreases by w × l /L, where w is the width of the wiring, expressed as a number of routing tracks. A regular cell normally creates a blockage as large as its outline at the first metal layer and a few smaller blockages in the second metal layer. Therefore, the routing supply decrease due to a regular cell of width w and length l in a bin, as shown in Figure 8(b), is Sivj =

l ×w (metal layer 1—horizontal) L×W l ×w , (metal layer 2—vertical) c/s × T2 × L×W where T1 and T2 are numbers of tracks on the first and second metal layers of the bin, respectively, s is the cell width in units of the vertical tracks (assuming the routing tracks on the second metal layer are vertical), and c is the number of vertical tracks occupied by the blockages of the cell on the second metal layer as well as blockages caused by connection to pins in metal layer 1. In multilayer designs, macros usually fully block the first two metal layers over the macros’ outlines. When a macro is placed over a bin as shown in Figure 8(c), (l ×w) the routing supply decreases by (T1 , T2 )× (L×W in the respective horizontal and ) vertical directions, where w and l are the width and length of the rectangular intersection of the macro and the bin, respectively. Although an optimal Steiner tree is not used for estimating the routing demand of a net, the probability of having a wire at location (x, y) within a net’s BBOX is approximated by adding up and normalizing K optimal Steiner trees of K sets of M randomly located pins each (e.g., K = 10, 000). A wiring distribution map (WDM) is obtained for each pin-count case in each direction. It is incorporated into the wirelength objective by adjustment of net weights. The T1 ×

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



419

Fig. 9. RISA modeling for routing demand.

Fig. 10. Partition net BBOX (NBB) into sub-BBOX (SBB) to model the detour due to macro blockage.

new net weights are calculated by finding the mean values of track usage over the WDMs for each of the different pin counts. The net weight q represents the expected number of wires crossing a cut line through the net bounding box when a net of a high pin count is routed, no matter whether the cut line is vertical or horizontal or where it is. Given a net BBOX overlapping with a bin (i, j ) as shown in Figure 9 and the net weight q in the unit of routing tracks, the routing demand of the bin increases by q × ( Yw×l , Xw×l ). ×W ×L When a net BBOX overlaps with macros, which usually block the first and second metal layers completely, the above model is revised such that possible detours due to macro blockage can be considered during the estimation. The revision is based on partitioning the net boundary box (NBB) into a set of sub-bounding boxes (SBB) efficiently. Given a net BBOX and the overlapping macros, SBBs are obtained by extending each boundary of each macro such that it cuts through the NBB, as shown in Figure 10(a). The SBB set also forms a coarse global routing grid on which a multilayer routing is performed to find a Steiner tree that connects all the SBBs where pins are located. If such a Steiner tree does not exist, expand the NBB by 2X and search again, repeating until the NBB covers the full core area. For each edge of the Steiner tree, if it passes through any boundary of any SBB, place a pseudo-pin at the center of the boundary of that SBB. For each SBB which covers the real pins and pseudo-pins, resize it to contain both real pins and pseudo-pins as shown in Figure 10(b) and estimate routing demand using the weighted BBOX model. The probabilistic analysis-based modeling generally assumes that (i) all nets are optimally routed with the shortest length, (ii) at most one change in ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

420



J. Cong et al.

direction per grid, and (iii) no change if direction in grid with pin, unless more than one pin in the same grid. A net-based stochastic model for 2-pin nets is presented to compute expected horizontal and vertical track usage with consideration of routing blockage [Lou et al. 2002]. It is further extended based on the experimental evidence that shows only a negligible number of nets will have detours and the number of nets with many bends can be ignored [Westra et al. 2004]. Peak routing demand and regional routing demand are estimated using Rent’s rule [Yang et al. 2002b]. Hu and Marek-Sadowska propose a Rent’sRule-based implicit white-space allocation method to catch the congestion picture and to weight the BBOX length of the nets based on the pin locations [Hu and Marek-Sadowska 2002]. Its implicit modeling helps to combine estimation and optimization in one step. Pin density per bin can be used as a metric for intrabin routing congestion, but it cannot model the interbin boundary congestion. Therefore, it is combined with probabilistic analysis-based modeling for completeness [Brenner and Rohe 2003]. 5.1.2 Topology-based Modeling. In a TP-based modeling method, for each net, a Steiner tree topology is generated on the given routing grid. Because these routing topologies usually have a fairly strong correlation with the topologies a global router generates, TP-based modeling can be quite accurate. At least, it generates a global routing solution, that is, it provides an upper bound for routability estimation. If a TP-based modeling method uses a topology similar to what the after-placement-router does, the fidelity of the model can be guaranteed. However, topology generation is often of high complexity; therefore, most research focuses mainly on efficiency. In one approach [Mayrhofer and Lauther 1990], a precomputed Steiner tree topology on a few grid structures is used for wiring-demand estimation. This approach is tailored for recursive partition-based placement. In another approach [Chang et al. 2003a], two algorithms of logarithmic complexity were recently proposed: a fast congestion-avoidance two-bend routing algorithm, LZ-router, for two-pin nets, and IncA-tree algorithm, which can support incremental updates for building a rectilinear Steiner arborescence tree (A-tree) for a multi-pin net. The LZ-router uses auxiliary data structures (similar to a segment-tree data structure) to find good quality routes by performing a binary search of the possible routes for a two-pin net. The wire density of a bin/region is defined as the wire usage of the bin/region divided by its area. For a net connecting two pins, P1 and P2 , which are bounded by a rectangle bounding box B (Figure 11(a)), if the maximum of the wire density of the vertical (horizontal) boundary bins of B on the vertical (horizontal) layer is W D BVb (W D BHb ), the wire density of region B on the horizontal (vertical) layer is W D BHr (W D BVr ), then the possible maximum wire density of VHV (HVH) routing is the maximum of W D BVb (W D BHb ) and W D BHr (W D BVr ). The VHV routing pattern (Figure 11(b)) will be chosen if its possible maximum wire density is smaller than that of HVH, otherwise, the HVH routing pattern (Figure 11(c)) is chosen. Assuming the VHV routing is used, the LZ-router recursively makes a horizontal cut on B and selects the one with a smaller wire density to route. It stops when the choice narrows to a single row. Given a g x × g y bin structure, the complexity ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



421

Fig. 11. An illustration of HVH and VHV routing selection of LZ-router.

Fig. 12. An illustration of IncA-tree algorithm.

for the LZ-router to route two pins with coordinates (i, j ) and (i + x, j + y) is O(l o g (| x | + | y |)l o g ( g x + g y )). The incremental A-tree (IncA-tree) algorithm is developed to efficiently update the routing topology for each pin location change. Given a grid structure consisting of (2m + 1) × (2m + 1) grids on the first quadrant, it can be recursively quadri-partitioned until each partition becomes the unit grid. For example, the grid structure in Figure 12 is first quadri-partitioned by the cut lines x = 4 and y = 4 to form four partitions. If there are some pins located inside a partition (including locations on the bottom and left boundaries, but excluding the locations on the right and top boundaries), the lower-left corner of the partition is the root for a subtree connecting all the pins inside this partition. For example, (4, 4) is the root for any pin at location (x, y) with 4 ≤ x < 8 and 4 ≤ y < 8. For a partition with some pins located inside, its root has an edge connecting to the lower-left corner of the previous level quadri-partition. In the above example, (4, 4) has an edge connecting to (0, 0). By recursively performing such quadri-partition, an A-tree can be built such that each pin at location (x, y) can connect to the origin (0, 0) with max(log x, log y) edges. Any pin insertion (deletion) to location (x, y) only incurs, at most, log(x + y) edge insertions (deletions). Therefore, each operation of moving a pin from (x1 , y 1 ) to (x2 , y 2 ) incurs, at most, log(x1 + y 1 ) + log(x2 + y 2 ) edge changes. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

422



J. Cong et al.

With the fast two-pin routing and incremental A-tree routing, for an n-pin net with bounding box length L on a g x × g y bin structure, the complexity of updating a non-root pin move is O(log L) times the complexity of LZ-routes O(log L log( g x + g y )), which is O((log L)2 log( g x + g y )). For moving the root, the complexity is O(n(log L)2 log( g x + g y )). While providing superior guidance for congestion optimization during the coarse placement, the runtime overhead of this congestion cost updating grows slowly due to the low logarithmic complexity. It is obvious that the IncA-tree may generate routes with longer wire length than the A-tree does and using it may overestimate the congestion. However, it is never intended to be used as the final measurement of the placement congestion. Instead, it is used to guide the placement optimization. 5.2 Optimization Techniques After routability is modeled, a routing-congestion picture is obtained on the global-routing grid structure. Basically, there are two ways to apply the modeling results to the placement optimization process: net weighting and cell weighting (cell inflation). Net weighting directly transfers a congestion picture into net weights and optimizes weighted wirelength. It can easily be incorporated into iterative placement algorithms such as simulated-annealing-based methods [Hu and MarekSadowska 2002; Chang et al. 2003a]. Cell weighting (a.k.a cell inflation) inflates cell sizes based on congestion estimation, so that cells in congested bins can be moved out of the bins after being inflated. It is more suitable for incorporation into constructive placement techniques, such as analytical placers [Parakh et al. 1998], quadrisection-based placers [Brenner and Rohe 2003], as well as iterative placement techniques, such as simulated annealing-based placers [Yang et al. 2003]. Example: Routability Control in mPL-R As a concrete example of routability optimization in global placement, techniques recently developed for mPL (Section 3.2) are described [Li et al. 2004]. This work consists of both demand-driven congestion reduction and supplydriven white-space allocation. Both of these components are defined in terms of the estimated routing overflow of each bin of a uniform rectilinear grid; that is, the amount by which routing demand exceeds routing supply in that bin (Section 5.1.2). Demand-Based Congestion Control via Topology-Based Weighted Wirelength. To reduce routing demand, the wirelength-driven placement of each subset of cells is supplemented by a routability-driven step. Immediately after the cells in a given subset are placed to minimize wirelength, they are moved again so as to reduce estimated routing congestion in the subregions they occupy. The changes in congestion are computed by explicit updates to the estimated routing topology. A secondary objective during this process is still to reduce the wirelength. Candidate cells for re-placement are selected based on the routing topology of nets incident on them. Nets are sorted in descending order according ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



423

Fig. 13. Congestion driven cell re-placement. The legend corresponds to the congestion in different routing regions. (a) Original placement of cell c in the optimal location for half-perimeter wirelength gives a weighted wirelength of 8.8. (b) Re-placement of c in a neighboring region gives a weighted wirelength of 6.2.

to the bin-capacity overflow they cause. The first s nets are picked, such that the total amount of routing resources they use exceeds the total overflow of the current placement. The cells connected by these nets are re-placed, as described next. For each cell c to be re-placed, the grid cell g ci j corresponding to its optimal bin location ℓ∗ = (i ∗ , j ∗ ) for half-perimeter wirelength is determined by holding all other cells fixed. Then the cell isplaced in each of the neighboring bins of  within a certain distance d , that is, bi j | |i ∗ − i| + | j ∗ − j | ≤ d . Each time c is placed in a bin, the topology for the nets incident on it is recomputed using the LZ router described in Section 5.1.2. The new placement for c is evaluated using a weighted wirelength of all the nets incident on c:  W Lc = W GTnetk × W Lnetk , (8)

where W GTnetk is the weight on netk , calculated as the average congestion of the grids cell netk crosses, and W Lnetk is the half perimeter wirelength of netk . Finally, c is placed in the bin that results in the shortest weighted wirelength. Figure 13 shows an example. Starting from the optimal location ℓ∗ for halfperimeter wirelength, Weighted wirelength (8) is evaluated at each bin in the neighborhood of ℓ∗ , and the bin that gives the shortest weighted wirelength is selected as c’s final location. In this example, the original location ℓ∗ gives a weighted wirelength of 8.8, whereas the final location gives a smaller weighted wirelength of 6.2. Detailed experiments show that mPL-R’s topology-based method of routability optimization is extremely effective. Compared to purely wirelength-driven mPL, mPL-R alone reduces the number of overflowed global routing bins by 83% over the complete set of IBM-Dragon Version 2 easy and hard benchmarks and leads to successful routing (by Cadence WarpRoute) of 14 out of 16 easy benchmarks and 9 out of 16 hard benchmarks. When combined with the whitespace allocation method described next, successful routing of all 32 benchmarks is achieved. Routed wirelength for the combined flow is reduced by 11.6% on average and is shorter that that of all other leading tools. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

424



J. Cong et al.

Supply-Based Congestion Control via White Space Allocation. In the global placement of the proposed flow, the amount of white space in a region may not accurately match its routing demand. To further reduce congestion, hierarchical white-space allocation is applied after global placement, as part of legalization and detailed placement. A slicing tree is constructed based on the geometric locations of all cells. The congestion level at each node of the tree is calculated and cell positions are adjusted at each node in a top-to-bottom fashion in order to redistribute available white space to relieve congestion. The slicing tree used in the proposed flow is similar to that in a partitioningbased global placement. However, cutlines are selected based on the geometric locations of the cells instead of the minimization of cut size. Cut directions are selected simply to keep aspect ratios of the resulting subregions suitably bounded. Every node in the tree maintains its cut direction, cut location, congestion, and total cell area as well as cell list. After the initial slicing tree construction, the congestion at each node in the tree is calculated from the bottom up. The congestion of a leaf node can be estimated by the total routing overflow of the grid cells contained in that leaf node. The congestion of an internal node is computed by a post-order traversal of the tree. The cutlines of the slicing tree are then adjusted one at a time in top-down order. Each cutline is translated from the more congested of its two subregions toward the less congested one, so that the amounts of white space allocated to the sibling subregions are linearly proportional to their congestion levels. The movement of the cutline is expressed as a coordinate scaling, as described below. This scaling is applied to the cells of the subregions in order to redistribute them in a way that relieves congestion without changing their relative ordering within the parent subregion. Consider a region r with lower-left corner (x0 , y 0 ), upper-right corner (x1 , y 1 ) and the original vertical cut direction at xcut = (x0 +x1 )/2. The area of this region is Ar = (x1 − x0 ) ( y 1 − y 0 ). Assume that the total area of cells for left subregion r0 and right subregion r1 are S0 and S1 , and the corresponding congestion levels are u0 and u1 , respectively. To distribute the total amount of white space, which is (Ar − S0 − S1 ), the amount of white space allocated to subregion r0 is ′ 0 can be derived as follows: . The new cutline location xcut (Ar − S0 − S1 ) u0u+u 1 γ = ′ xcut

u0 0 +u1

S0 +(Ar −S0 −S1 ) u Ar

,

= γ x1 + (1 − γ )x0 ,

where γ is the ratio of the left subregion area to Ar after the cutline adjustment. As an output of this step, a global placement that contains overlaps is obtained. Moreover, cells may not be placed along a row. The cutline adjustment approach is performed in the same spirit as fractional cut (Section 3.1.1), where horizontal cuts are not aligned with row boundaries. Figure 14(b) gives an example of this process. The total cell area and congestion at every tree node of the slicing tree are given. Cut lines are adjusted from top to bottom such that the white space in each subregion is proportional to its congestion level. After white space allocation, a local-swapping-based detailed placement is applied to obtain a legal placement. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



425

Fig. 14. (a) A slicing tree and its corresponding cut lines and regions. (b) A slicing tree after congestion estimation and regions after cut lines adjustment.

This simple white-space allocation scheme is remarkably effective. When used alone as part of detailed placement, it consistently reduces both routed wirelength and the number of overflowed global-routing bins. When combined with the topology-based congestion control in mPL-R as described above, it leads to higher completion rates and shorter routed wirelengths than can be achieved by any other leading tool. 6. CONCLUSION Algorithms for large-scale circuit placement play a vital role in today’s interconnect-limited nanometer designs. Recent studies suggest that the potential exists for a full technology generation’s worth of performance gains in the placement step alone. In this article, we have reviewed the current state of the art, from the basic paradigms for scalable wirelength-driven placement to techniques for performance and routability optimization. We believe that hierarchical/multilevel methods are needed for scalability, and weighted wirelength minimization provides a general framework for performance and routability optimization in placement. Ideally, systematic empirical comparisons would be used to understand the trade-offs of the different algorithms summarized in this article. However, direct numerical comparisons of these algorithms are difficult, partly due to limited accessiblity to these algorithms, and partly due to differences in their assumptions. Recently, comparisons based on wirelength minimization have been ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

426



J. Cong et al.

attempted [Adya et al. 2003]. We are not aware of any comprehensive quantitative comparison in terms of performance or routability optimization. More work is needed to build a common framework for direct comparisons of different placement methods. REFERENCES ADYA, S., CHATURVEDI, S., ROY, J., PAPA, D. A., AND MARKOV, I. 2004. Unification of partitioning, placement, and floorplanning. In Proceedings of the International Conference on Computer-Aided Design (Nov.). 550–557. ADYA, S., YILDIZ, M., MARKOV, I., VILLARRUBIA, P., PARAKH, P., AND MADDEN, P. 2003. Benchmarking for large-scale placement and beyond. In Proceedings of the International Symposium on Physical Design. 95–103. ADYA, S. N. AND MARKOV, I. L. 2002. Consistent placement of macro-blocks using floorplanning and standard-cell placement. In Proceedings of the International Symposium on Physical Design (Apr.). 12–17. AGNIHOTRI, A. R., YILDIZ, M. C., KHATKHATE, A., MATHUR, A., ONO, S., AND MADDEN, P. H. 2003. Fractional cut: Improved recursive bisection placement. In Proceedings of the International Conference on Computer-Aided Design. 307–310. ALPERT, C. J. 1998. The ISPD98 circuit benchmark suite. In Proceeding of the International Symposium on Physical Design. 85–90. ARROW, K., HURIWICZ, L., AND UZAWA, H. 1958. Studies in Nonlinear Programming. Stanford University Press, Stanford, Calif. BETZ, V. AND ROSE, J. 1997. VPR: A new packing, placement, and routing tool for FPGA research. In Proceedings of the International Workshop on FPL. pp. 213–222. BRANDT, A. 1986. Algebraic multigrid theory: The symmetric case. Appl. Math. Comp. 19, 23–56. BRANDT, A. AND RON, D. 2002. Multigrid solvers and multilevel optimization strategies. Multilevel Optimization and VLSICAD. Kluwer Academic Publishers, Boston, Mass., Chap. 1. BRENNER, U., PAULI, A., AND VYGEN, J. 2004. Almost optimum placement legalization by minimum cost flow and dynamic programming. In Proceedings of the International Symposium on Physical Design. 2–8. BRENNER, U. AND ROHE, A. 2003. An effective congestion-driven placement framework. IEEE Trans. CAD 22, 4 (Apr.), 387–394. BRENNER, U. AND ROHE, A. Apr 2002. An effective congestion-driven placement framework. In Proceedings of the International Symposium on Physical Design. BREUER, M. 1977. Min-Cut Placement. J. Design Automat. Fault Tolerant Comput. 1, 4 (Oct.), 343–362. BRIGGS, W., HENSON, V., AND MCCORMICK, S. 2000. A Multigrid Tutorial, 2nd ed. SIAM, Philadelphia, Pa. BURSTEIN, M. AND YOUSSEF, M. N. 1985. Timing influenced layout design. In Proceedings of the ACM/IEEE Design Automation Conference. ACM, New York, 124–130. CADENCE DESIGN SYSTEMS, INC. 1999. QPlace version 5.1.55, compiled on 10/25/1999. Envisia ultra placer reference. CALDWELL, A., KAHNG, A. B., AND MARKOV, I. 2000a. Improved algorithms for hypergraph partitioning. In Proceedings of the Asia South Pacific Design Automation Conference. CALDWELL, A., KAHNG, A., AND MARKOV, I. 2000b. Can recursive bisection produce routable placements? In Proceedings of the 37th IEEE/ACM Design Automation Conference. ACM, New York, 477–482. CALDWELL, A., KAHNG, A., AND MARKOV, I. 2000c. Iterative partitioning with varying node weights. VLSI Design II, 3, 249–258. CALDWELL, A., KAHNG, A., AND MARKOV, I. 2000d. Optimal partitioners and end-case placers for standard-cell layout. IEEE Trans. on CAD 19, 11, 1304–1314. CHAN, T., CONG, J., KONG, T., AND SHINNERL, J. 2003a. Multilevel circuit placement. Multilevel Optimization in VLSICAD. Kluwer, Boston, Mass., Chap. 4. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



427

CHAN, T., CONG, J., KONG, T., AND SHINNERL, J. 2000. Multilevel optimization for large-scale circuit placement. In Proceedings of the IEEE International Conference on Computer-Aided Design (San Jose, Calif.). IEEE Computer Society Press, Los Alamitos, Calif. 171–176. CHAN, T., CONG, J., KONG, T., SHINNERL, J., AND SZE, K. 2003b. An enhanced multilevel algorithm for circuit placement. In Proceedings of the IEEE International Conference on Computer Aided Design (San Jose, Calif.). IEEE Computer Society Press, Los Alamitos, Calif. CHAN, T., CONG, J., AND SZE, K. 2005. Multilevel generalized force-directed method for circuit placement. In Proceedings of the International Symposium on Physical Design. CHANG, C.-C., CONG, J., PAN, D., AND YUAN, X. 2003a. Multilevel global placement with congestion control. IEEE Trans. CAD 22, 4 (Apr.), 395–409. CHANG, C. C., CONG, J., AND XIE, M. 2003b. Optimality and scalability study of existing placement algorithms. In Proceedings of the Asia South Pacific Design Automation Conference. 621– 627. CHANG, C.-C., LEE, J., STABENFELDT, M., AND TSAY, R. S. 1994. A practical all-path timing-driven place and route design system. In Proceedings of the Asia-Pacific Conference on Circuits and Systems. 560–563. CHEN, C., YANG, X., AND SARRAFZADEH, M. 2000. Potential slack: An effective metric of combinational circuit performance. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. ACM, New York, 198–201. CHEN, H., CHENG, C.-K., CHOU, N.-C., KAHNG, A., MACDONALD, J., SUARIS, P., YAO, B., AND ZHU, Z. 2003. An algebraic multigrid solver for analytical placement with layout-based clustering. In Proceedings of the IEEE/ACM Design Automation Conference, ACM, New York, 794–799. CHENG, C. AND KUH, E. 1984. Module placement based on resistive network optimization. IEEE Trans. CAD. CAD-3, 3. CHENG, C.-L. E. 1994. RISA: Accurate and efficient placement routability modeling. In Proceedings of the International Conference on Computer-Aided Design, 690–695. CHU, C. AND VISWANATHAN, N. 2004. FastPlace: Efficient analytical placement using cell shifting, iterative local refinement, and a hybrid net model. In Proceedings of the International Symposium on Physical Design (Apr.). 26–33. CONG, J. 2001. An interconnect-centric design flow for nanometer technologies. Proc. IEEE 89, 4 (Apr.), 505–527. CONG, J. AND LIM, S. 2000. Edge separability based circuit clustering with application to circuit partitioning. In Proceedings of the Asia South Pacific Design Automation Conference (Yokohama Japan). 429–434. CONG, J., ROMESIS, M., AND SHINNERL, J. 2005a. Fast floorplanning by look-ahead enabled recursive bipartitioning. In Proceedings of the Asia Asia South Pacific Design Automation Conference. CONG, J., ROMESIS, M., AND SHINNERL, J. 2005b. Robust mixed-size placement by recursive legalized bipartitioning. Report 040057, Computer Science Dept., University of California, Los Angeles, Calif., ftp://ftp.cs.ucla.edu/tech-report/2005-reports/040057.pdf. CONG, J., ROMESIS, M., AND XIE, M. Nov 2003a. Optimality and stability of timing-driven placement algorithms. In Proceedings of the IEEE International Conference on Computer Aided Design. (San Jose, Calif.). IEEE Computer Society Press, Los Alamitos, Calif. CONG, J., ROMESIS, M., AND XIE, M. 2003b. Optimality, scalability and stability study of partitioning and placement algorithms. In Proceedings of the International Symposium on Physical Design. 88–94. CONG, J., ROMESIS, M., AND XIE, M. 2004. UCLA Optimality Study Project. http://cadlab.cs. ucla.edu/∼pubbench. DUNLOP, A. AND KERNIGHAN, B. 1985. A procedure for placement of standard-cell VLSI circuits. IEEE Trans. CAD. CAD-4, 1 (Jan.). DUNLOP, A. E., AGRAWAL, V. D., DEUTSCH, D. N., JUKL, M. F., KOZAK, P., AND WIESEL, M. 1984. Chip layout optimization using critical path weighting. In Proceedings of the ACM/IEEE Design Automation Conference. ACM, New York, 133–136. EISENMANN, H. AND JOHANNES, F. 1998. Generic global placement and floorplanning. In Proceedings of the 35th ACM/IEEE Design Automation Conference. ACM, New York, 269–274. EVANS, L. C. 2002. Partial Diferential Equations. American Mathematical Society, Providence, R. I. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

428



J. Cong et al.

FIDUCCIA, C. M. AND MATTHEYSES, R. M. 1982. A linear-time heuristic for improving network partitions. In Proceedings of the Design Automation Conference, ACM, New York, 175–181. GAO, T., VAIDYA, P. M., AND LIU, C. L. 1991. A new performance driven placement algorithm. In IEEE/ACM International Conference on Computer-Aided Design. ACM, New York, 44–47. GOLUB, G. AND VAN LOAN, C. 1989. Matrix Computations, 2nd ed. The Johns Hopkins University Press, Baltimore, Md. GOTO, S. 1981. An efficient algorithm for the two-dimensional placement problem in electrical circuit layout. IEEE Trans. Circuits and Systems 28, 1 (Jan.), 12–18. HAGEN, L., HUANG, J. H., AND KAHNG, A. B. 1997. On implementation choices for iterative improvement partitioning algorithms. IEEE Trans. CAD 16, 10, 1199–1205. HAGEN, L. W., HUANG, D. J.-H., AND KAHNG, A. B. 1995. Quantified suboptimality of VLSI layout heuristics. In Proceedings of the Design Automation Conference. ACM, New York, 216–221. HALPIN, B., CHEN, C., AND SEHGAL, N. 2001. Timing driven placement using physical net constraints. In Proceedings of the ACM/IEEE Design Automation Conference. ACM, New York, 780–783. HAMADA, T., CHENG, C. K., AND CHAU, P. M. 1993. Prime: A timing-driven placement tool using a piecewise linear resistive network approach. In Proceedings of the ACM/IEEE Design Automation Conference. ACM, New York, 531–536. HAUGE, P. S., NAIR, R., AND YOFFA, E. J. 1987. Circuit placement for predictable performance. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, ACM, New York, 88–91. HU, B. AND MAREK-SADOWSKA, M. 2002. Congestion minimization during placement without estimation. In Proceedings of the International Conference on Computer-Aided Design. 739–745. HU, B. AND MAREK-SADOWSKA, M. 2003. Wire length prediction based clustering and its application in placement. In Proceedings of the Design Automation Conference. HU, B. AND MAREK-SADOWSKA, M. 2004. Fine granularity clustering based placement. IEEE Trans. CAD 23, 1264–1276. HUR, S.-W. AND LILLIS, J. 2000. Mongrel: Hybrid techniques for standard-cell placement. In Proceedings of the IEEE International Conference on Computer-Aided Design (San Jose, Calif., Nov.). IEEE Computer Society Press, Los Alamitos, Calif., 165–170. JACKSON, M. AND KUH, E. S. 1989. Performance-driven placement of cell based IC’s. In Proceedings of ACM/IEEE Design Automation Conference. ACM, New York, 370–375. KAHNG, A. AND REDA, S. 2004. Placement feedback: A concept and method for better min-cut placements. In Proceedings of ACM/IEEE Design Automation Conference. ACM, New York, 357– 362. KAHNG, A. AND WANG, Q. 2004. Implementation and extensibility of an analytic placer. In Proceedings of the International Symposium on Physical Design. 18–25. KARYPIS, G. 1999. Multilevel algorithms for multi-constraint hypergraph partitioning. Tech. Rep. 99-034, Department of Computer Science, University of Minnesota, Minneapolis, Minn. KARYPIS, G. 2003. Multilevel hypergraph partitioning. Multilevel Optimization and VLSICAD. Kluwer Academic Publishers, Boston, Mass., Chap. 3. KHATKHATE, A., LI, C., AGNIHOTRI, A. R., ONO, S., YILDIZ, M. C., KOH, C.-K., AND MADDEN, P. H. 2004. Recursive bisection based mixed block placement. In Proceedings of the International Symposium on Physical Design. KLEINHANS, J., SIGL, G., JOHANNES, F., AND ANTREICH, K. 1991. Gordian: VLSI placement by quadratic programming and slicing optimization. IEEE Trans. CAD CAD-10, 356–365. KONG, T. 2002. A novel net weighting algorithm for timing-driven placement. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. ACM, New York, 172–176. LI, C. AND KOH, C.-K. 2003. On improving recursive bipartitioning-based placement. Tech. Rep. TR-ECE 03-14, Purdue University. LI, C., XIE, M., KOH, C., CONG, J., AND MADDEN, P. 2004. Routability-driven placement and white space allocation. In Proceedings of the International Conference on Computer-Aided Design. 394– 401. LOU, J., THAKUR, S., KRISHNAMOORTHY, S., AND SHENG, H. 2002. Estimating routing congestion using probabilistic analysis. IEEE Trans. CAD 21, 1 (Jan.), 32–41. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

Large-Scale Circuit Placement



429

LUK, W. K. 1991. A fast physical constraint generator for timing driven layout. In Proceedings of the ACM/IEEE Design Automation Conference. ACM, New York, 626–631. MAREK-SADOWSKA, M. AND LIN, S. P. 1989. Timing driven placement. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. ACM, New York 94–97. MARQUARDT, A., BETZ, V., AND ROSE, J. 2000. Timing-driven placement for FPGAs. In Proceedings of the ACM Symposium on FPGAs. ACM, New York, 203–213. MAYRHOFER, S. AND LAUTHER, U. 1990. Congestion-driven placement using a new multipartitioning heuristic. In Proceedings of the International Conference on Computer-Aided Design. 332–335. MORTON, K. W. AND MAYERS, D. F. 1994. Numerical Solution of Partial Differential Equations. Cambridge University Press. NAIR, R., BERMAN, C. L., HAUGE, P., AND YOFFA, E. J. 1989. Generation of performance constraints for layout. IEEE Trans. CAD Integ. Circ. Syst. 8, 8, 860–874. NAYLOR, W. C., DONELLY, R., AND SHA, L. 2001. Nonlinear optimization system and method for wire length and delay optimization for an automatic electric circuit placer. ONO, S. AND MADDEN, P. 2005. On structure and suboptimality in placement. In Proceedings of the Asia South Pacific Design Automation Conference. PARAKH, P. N., BROWN, R. B., AND SAKALLAH, K. A. 1998. Congestion driven quadratic placement. In Proceedings of the Design Automation Conference. 275–278. QUINN, N. AND BREUER, M. 1979. A force-directed component placement procedure for printed circuit boards. IEEE Trans. Circ Syst CAS CAS-26, 377–388. RAMACHANDARAN, P., ONO, S., AGNIHOTRI, A., DAMODARA, P., SRIHARI, H., AND MADDEN, P. 2005. Optimal placement by branch-and-price. In Proceedings of the Asia South Pacific Design Automation Conference. (Jan.). REN, H., PAN, D., AND KUNG, D. 2004. Sensitivity guided net weighting for placement driven synthesis. In Proceedings of the International Symposium on Physical Design. 10–17. SAAD, Y. 1996. Iterative Methods for Sparse Linear Systems. PWS publishing, Pacific Grove, Calif. SANKAR, Y. AND ROSE, J. 1999. Trading quality for compile time: Ultra-fast placement for FPGAs. In FPGA ‘99, ACM Symposium on FPGAs. ACM, New York, 157–166. SARRAFZADEH, M., KNOL, D. A., AND TELLEZ, G. E. 1997a. A delay budgeting algorithm ensuring maximum flexibility in placement. IEEE Trans. CAD of Integ. Circ. Syst. 16, 11, 1332–1341. SARRAFZADEH, M., KNOL, D. A., AND TELLEZ, G. E. 1997b. Unification of budgeting and placement. In Proceedings of the ACM/IEEE Design Automation Conference. ACM, New York, 758– 761. SARRAFZADEH, M., WANG, M., AND YANG, X. 2002. Modern Placement Techniques. Kluwer, Boston, Mass. SENN, M., SEIDL, U., AND JOHANNES, F. 2002. High quality deterministic timing driven FPGA placement. In Proceedings of the ACM Symposium on FPGAs. ACM, New York. SIGL, G., DOLL, K., AND JOHANNES, F. 1991. Analytical placement: A linear or a quadratic objective function? In Proceedings of the 28th ACM/IEEE Design Automation Conference. ACM, New York, 427–432. SRINIVASAN, A., CHAUDHARY, K., AND KUH, E. S. 1991. RITUAL: A performance driven placement for small-cell ICs. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. ACM, New York, 48–51. SWARTZ, W. AND SECHEN, C. 1995. Timing-driven placement for large standard cell circuits. In Proceedings of the ACM/IEEE Design Automation Conference. ACM, New York, 211–215. TELLEZ, G. E., KNOL, D. A., AND SARRAFZADEH, M. 1996. A performance-driven placement technique based on a new net budgeting criterion. In Proceedings of the International Symposium on Circuits and Systems. 504–507. TSAY, R., KUH, E., AND HSU, C. 1988. Proud: A fast sea-of-gates placement algorithm. IEEE Desi. Test Comput. 44–56. TSAY, R. S. AND KOEHL, J. 1991. An analytic net weighting approach for performance optimization in circuit placement. In Proceedings of the ACM/IEEE Design Automation Conference. ACM, New York, 620–625. VORWERK, K., KENNINGS, A., AND VANNELLI, A. 2004. Engineering details of a stable force-directed placer. In Proceedings of the International Conference on Computer-Aided Design. 573–580. ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.

430



J. Cong et al.

VYGEN, J. 1997. Algorithms for large-scale flat placement. In Proceedings of the 34th ACM/IEEE Design Automation Conference. ACM, New York, 746–751. VYGEN, J. 2000. Four-way partitioning of two-dimensional sets. Report 00900-OR, Research Institute for Discrete Mathematics, University of Bonn, Bonn, Germany. WANG, Q., JARIWALA, D., AND LILLIS, J. 2005. A study of tighter lower bounds in LP relaxation-based placement. In Proceedings of the Great Lakes Symposium on VLSI. To appear. WANG, M., YANG, X., AND SARRAFZADEH, M. 2000. Dragon2000: Standard-cell placement tool for large circuits. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, ACM, New York, 260–263. WESTRA, J., BARTELS, C., AND GROENEVELD, P. 2004. Probabilistic congestion prediction. In Proceedings of the International Symposium on Physical Design. 204–209. XIU, Z., MA, J., FOWLER, S., AND RUTENBAR, R. 2004. Large-scale placement by grid warping. In Proceedings of the Design Automation Conference. 351–356. YANG, X., CHOI, B., AND SARRAFZADEH, M. 2002a. Timing-driven placement using design hierarchy guided constraint generation. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. ACM, New York, 177–180. YANG, X., CHOI, B., AND SARRAFZADEH, M. 2003. Routability-driven white space allocation for fixeddie standard-cell placement. IEEE Trans, CAD 22, 4 (Apr.), 410–419. YANG, X., KASTNER, R., AND SARRAFZADEH, M. 2002b. Congestion estimation during top-down placement. IEEE Trans. CAD 21, 1 (Jan.), 72–80. YILDIZ, M. AND MADDEN, P. 2001a. Global objectives for standard cell placement. In Proceedings of the 11th Great-Lakes Symposium on VLSI. 68–72. YILDIZ, M. AND MADDEN, P. 2001b. Improved cut sequences for partitioning-based placement. In Proceedings of the Design Automation Conference. 776–779. YOUSSEF, H., LIN, R. B., AND SHRAGOWITZ, S. 1992. Bounds on net delays. IEEE Trans. Circ. Syst. 39, 11, 815–824. YOUSSEF, H. AND SHRAGOWITZ, E. 1990. Timing constraints for correct performance. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. ACM, New York, 24–27. Received October 2004; revised January 2005; accepted January 2005

ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 2, April 2005.