A Method for Mapping Unimodular Loops into Application Specific Parallel Structures

⇓Завантажити PDF (eng.)

(Published in 2nd Int. Conf. “Parallel Processing and Applied Mathematics”, PPAM’97, Zakopane (Poland), Sept.2-5, 1997, p362-371)

A METHOD FOR MAPPING UNIMODULAR LOOPS INTO APPLICANION SPECIFIC SIGNAL PROCESSORS

A.M. Sergyienko*, Guzinski A.**, Kaniewski J.S.**
*Department of Computer Science, National Technical University of Ukraine, KPI-2020, Pr. Pobedy, 37, 252056, Kiev, Ukraine, E-mail: kanevski@comsys.kpi.ua
**Institute of Math.& Computer Science, Technical University of Coszalin, Coszalin, Poland, E-mail: kanievsk@tu.koszalin.pl

ABSTRACT
A method for mapping unimodular loop nests into application specific structures is presented. The method consists in representing the reduced dependence draph of the algorithm in multidimensional index space and in mapping this graph into processor subspace and event subspace. Some restrictions, which constrain the reduced dependence draph, help to simplify the mapping process, and to get pipelined processing units. An example of IIR- filter structure systhesis illustrates the mapping process.
Keywords: algorithm mapping, application specific circuits, digital signal processing.

1. Introduction
The automatic development of ASICs for digital signal processing (DSP) helps to reduce both the way from the idea to the market and development costs. The silicon compiler can ensure the direct way from some DSP algorithms to the chip which computes this algorithm. And the design period is defined first of all by the technological constraints [1]. The use of programmable devices, such Field Programmable Gate Arrays (FPGAs), can provide hardware prototypes with minimum fabrication delay [2].

Such steps of the design process as testing of signal processing algorithm, logic design and verification, routing and translation the circuit into the format of the program for FPGA are automatized now. But the development of the structure which realizes the given signal processing algorithm is implemented an hand and for this job skilled specialists are needed [2]. Therefore the development of programming tools for mapping the DSP-algorithms into structures which adapted to the properties of FPGAs is of great importance.

DSP algoritms usually process a data flow in real time and therefore have an iterative nature. We will consider DSP algoritms which are represented with the unimodular loop nests or regular recurrent equations. The kernel of the loop nest has one or more statements, such as:

Sti: a[I] = f(a[ID1], b[ID2],…),

where I is the index vector of variables which represent a point in the iterational space, Dj – vector of increments to the index of j-th variable which characterizes data dependence between iterations (IDj) and I.

This means that all computations which belong to a single iteration can be sheduled in such a way, that they begin in a single moment of time [5]. There are well known methods of mapping such algorithms into systolic array structures (see, for example, [3…8]). These methods are based on affine transform of the iterational space Zn, IZn with the matrix P, into subspase Zm of structures and subspace Zn-m of events. As a result of the transform, the statement Sti of the iteration I is processed in the processing unit (PU) with coordinates KS = PS I in the time step which is signed as KT = PT I, where KSZn, KSZn-m, and P=(PST, PTT)T.

If the algorithm has cycles of dependencies between iterations, that is expressed by cyclic reduced dependence graph, then mapping such algorithm is more complex [5]. The methods for mapping these algoritms are known which are based on mapping each statement Sti using the separate affine mapping function [8, 9]. Then the searching for algorithm mapping is implemented by optimizing the inequality system which express restrictions to the affine mapping functions. The solving this problem can give the optimum solution, but this solving is rather hard [9].

Mehtioned above methods have a set of restrictions which do not permit its direct using for the development of DSP structures. First of all, it is considered that the asignments Sti which belong to a singe iteration must to be processed simultaneously during a single time step. Therefore, although the systolic array represent a multidimensional pipelined computer system, separate and complex operators and statements cannot be computed in pipelined manner. The second restriction is that the mapping result represent a structure not with the given throughput, but with the maximum thronghput. This restriction is something reduced by the synthesis of fixed size systolic arrays but the problem of single time step processing of complex operators takes a place [8, 10, 11].

The use of the pipelined PUs offers the increased throughput of DSP-processor due to the possibility to begin the next operator processing before completing the previons one. Therefore, the development of pipelined PUs is attractive for hardware relization of any algorithms, among them DSP-algoriths. In [12] a method for systolic array design with pipelined PUs is proposed. But this method is not suitable because it consists in the manual introduction of pipeline stages into the given systolic array structure.

This work deals with a new method for designing application specific DSP-processor structures by mapping algoritms which are given as unimodular loops.

2. Assumed algorithms and goals of the method.
The proposed method represent modified known methods for structured synthesis of systolic arrays. There are the following goals of the method modifications:

  • processing onerafors for more than a sigle cycle of time. This provides designing DSPprocessors with given throughput, computing complex operators of the algorithm, and operators can have different complexity. The cyclic dependencies in algorithm are approved too. Different statemet Sti of the loop kernel can start their processing at different clock cycles, and this enlarges the area of processed algorithms;
  • internal pipelining of PUs. By pipelining the PUsinternally, the latency of PU can become more then one cycle of time. But the PU has higher throughput because the maximal allowable clock frequency is higher;
  • hardware sharing, that means that the same hardware unit exelutes similar statements in sequential order, unlike one executes a single statement when known methods are used.

Consider the algorithm which is represented with a single loop:

for i = 1, Ui do
 (y1(i),...,yp(i)) = f(x1(i+di1),...,yq(i+diq))
 end.

      (1)

Here the operator f is processed by the algorithm which consists of Uj unar and binar statemens Stj , there are not any conditional ststements. Therefore the algorithm can be represented as the following:

for i = 1,Ui do 
 {statement St1}
 . . .
 Stj: y[i,j]=ϕj,k(y[i-di1,j],y[i-di2,j])
 . . .
 {statement Stuj}
 end,

      (2)

where ϕj,k(x,y) is the operator of the k-th type which processed the operands x and y. This loop can be transferred into the three-staged loop nest. In the (i, j, k)-th iteration of such a loop nest only j-th statement of k-th type is processed or nothing is done:

for i = 1,Ui do
 for j = 1,Uj do
   for k = 1,Uk do 
   if (j,k)∈Φ then y[i,j]=ϕj,i(y[i-di1,j],y[i-di2,j])
   end
 end
end,

      (3)

where Φ is a set of feasible couples (j,k), which specify type and order of operator implementation in the algorithm (2).

Therefore, the loop (2) which kernel consists of several different statements can be represented as a triple loop nest (3). The computing of this loop nest takes place in the three dimensional iterational space K3 = { 1 ≤ i ≤ Ui, 1 ≤ j ≤ Uj, 1 ≤ k ≤ Uk}⊂ Z3. Each operator is represented by the vector Ki ∈ K3, and the dependence between two operators Ki, Kl is represented by the vector of dependence Dj = KlKi. In most cases the vector Dj represent a variable which is a result of the operator Ki, and is transferred to different operators Kl as a imput variable. A generalised loop nest with such a kerrel can be represented in such a manner too.

Alove mentioned methods of the application specific structure synthesis suppose that PUs implement a given set of operators. In this paper application specific PUs are considered, which implement a single operator ϕk. A set of PUs for the DSP applications can constist of simple PUs like adder, multiplier, ROM, and their storage unit can be FIFO, which can consist in most cases of a single register of the result.

3. Mapping unimodular cycles into the application specific processor structure.
In the methods, described in [3,…,8, 10-12] the graph GA of the algorithm is represented in the n-dimensional index space Zn. The graph GA of the systolic algorithm is a regular lattice, therefore it is represented by its compact form, which consists of unrgual dependence vectors Dj, and processing domain Kn ⊂ Zn. When the algorithm has a complex loop kernel, like in the algorithm (2), then a reduced dependence graph GAR can represent the compact form of the one. This oriented, in common case, cyclic graph has N nodes of operators Ki and M edges of dependencies Dj.

Consider a simple example of an algorithm:

 for i = 1, N do
     for 2 j = 1, M do
     St1: a[i,j] = b[i-1,j-1];
     St2: b[i,j] = a[i,j];
     end
 end

This algorithm is represented by reduced dependence graph GAR which is shown on the fig. 1.

Fig. 1. The reduced dependence graph GAR.

Vectors-nodes D1 and D2 represent movings of dates a and b between statements St1, St2, and are weighted with the distance vectors (0, 0)T and (1 1)T respectively.

The reduced dependence graph GAR can be represented in the n-dimensional space by the matrix D of data dependence vectors Dj, matrix K of vectors-nodes Ki, and incidence matrix A of this graph. Then matrices K, D, A form an algorithm conficuration CA.

The following definitions and depedencies are true for configurations CA . The configuration CA is correct if KiKj; i, j = 1,…,N, i ≠ j, i.e. if there is a linear depedence between configuration matrices, i.e.

D = KA; K = D0A0-1,       (4)

where A0 is the incidence matrix for the maximum spanning tree of GAR , and D0 is a matrix of vectors-arcs of this tree. For example, for the graph on the fig.1 the following equation takes place:

where DB is a basis vector-edge, which connect the zero point of the space with the vectornode K1.

The sum of vector-edges Dj, which belong to any loop of the graph GAR must be equal to zero, i.e. for the i-th loop the following equation is true

where bij is the element of the i-th row of the cyclomatic matrice for the graph GAR. Configurations CA1 = (K1,D1,A1) and CA2 = (K2,D2,A2) are equivalent if they are correct and represent the same algorithm graph, i.e. A1 = A2.

The following theorem is used to implement the equivalent transformations of configurations.

The correct configuration CA1 is equivalent to the configuration CA2 iff A1= A2 and K2= F(K1), where F is an injective function. For example, the following transformations give equivalent configurations: permutations of vectors Ki in the space Zn or permutations of columns of the matrix K1, multiplications of the matrix K1 and non-singular matrices P.

The graph GS of the processor structure is represented by its configuration GAS = (KS, DS, A), where KS is the matrix of vectors-nodes KSi ∈ Zm which give coordinates of PUs, and DS is the matrix of vector-arcs DSj ∈ Zm which represent connections between PUs, m < n.

⇓Завантажити PDF (eng.)