Tensor approach to the application specific processor design

⇓Завантажити PDF (eng.)

Proc.10 ^th Int. Conf. “The Experience of Designing and Application of CAD Systems in Microelectronics”, CADSM’ 2009, 24-28 Feb. 2009. –IEEE Library. –2009. –Р. 146-149.

Tensor Approach to the Application Specific Processor Design

Oleg Maslennikow, Oleg Maslennikow − Polytechnica Koszalinska, Poland, E-mail: oleg@moskit.ie.tu.koszalin.pl
Anatolij Sergiyenko, Yurij Vinogradow − National Technical University of Ukraine E-mail: aser@comsys.ntu-kpi.kiev.ua Anatolij Sergiyenko, Yurij Vinogradow

Abstract – A method for mapping an algorithm, which is represented by the loop nest into the application specific structure is proposed. The method consists in translating the loop nest into the tensor equation. The tensor equation represents a set of structural solutions. The optimized solution finding consists in solving this equation in integers. The proposed limitations to the parts of the tensors help to derive the pipelined structure and simplify the mapping process. The method is illustrated by the example of the IIR-filter structure synthesis. It is intended for mapping DSP algorithms into FPGA.
Keywords – Algorithm mapping, SDF, DSP, FPGA.

I. INTRODUCTION

At present, the field programmable gate arrays (FPGAs) are widely used to implement the high-speed DSP algorithms. But its programming is usually consists in the functional network drawing, which contains the ready to use modules. The design of such modules, or the programming tools like module generators remains the very complex task. This task is formalized now for the acyclic algorithms like FIR filters, and is still under investigations for the cyclic algorithms like IIR filters [1,2]. Therefore, the development of the programming tools which help to map the DSP algorithms into the networks, which are adapted to the FPGA architecture are of demand.

Consider a DSP algorithm, which can be represented by a set of recurrent equations or a loop nest. In the kernel of the regular loop nest stay one or a set of assignments like:

St_i: a[I] = f(a[I+D₁ ], b[I+D₂ ],…),

where I is an index vector, which represents a point in the iteration space, D_j is an index increment vector of the j-th variable, which represents the data dependence between I-th, and (I+D_j )-th iterations. The irregular loop nest can be remapped into the regular one by the data pipelining or global transfer removing techniques [3,4].

If the loop kernel has a set of independent operators St_i, then this set can be represented as a single vector I in the iteration space. The methods of the systolic array synthesis are well known which utilize the mapping of such algorithms [3-5]. These methods are based on the affined transformation with the matrix P of the iteration space Zⁿ , I ∈ Zⁿ into the subspace of structures Z^m and events Z^n-m , so as the operator St_i belonging to the iteration I is calculated in the processor unit (PU) with the coordinates K_s = P_s I in the clock cycle marked by K_t = P_t I, K_s ∈ Zⁿ , K_t ∈ Zⁿ . If we consider the usual situation when n − m = 1, then the conditions of such a mapping are D_j P_t ≥ 0 (monotony condition) and detP ≠ 0 (injection condition).

The systolic array synthesis methods have a set of limitations, which do not provide their direct utilization in the DSP system design. They consider that operators St i , from a single iteration must be implemented simultaneously and their duration must be equal no more than a single clock cycle. Therefore, the complex operators could not be implemented in the pipelined mode.

In the representation a new method of DSP application specific processor design is proposed on the base of mapping the algorithms, which are given by the regular loop nests. The resulting processors have the pipelined ALUs and fit effectively the modern FPGAs.

II.INITIAL DATA FOR THE SYNTHESIS

Consider the a single loop algorithm:

for i = 1, U_i do
   (y₁ (i), ..., y_p (i)) = f(x₁ (i − d_i1 ), ..., y_q (i − d_iq ))
end.

(1)

Here the function f is calculated using U_j binary assignments St_j . Therefore, the algorithm (1) can be
represented as the following:

for i = 1, U_i do
    {statement St₁ }
     . . .
       St_j : y[i,j] = φ_j,k (y[i-d_i1, j], y[i-d_i2, j])
     . . .
    {statement Stu_j}
end,

(2)

where φ_j,k(x,y) is the operator of the k-th type, which is implemented at the operands x, y. This loop can be transformed into the next three level loop nest, such that in the (i, j, k)-th iteration only j-th operator of k-th type is implemented.

for i = 1,U_i do
  for j = 1,U_j do
    for k = 1,U_k do
      if (j,k) ∈ Φ then y[i,j] = φ_j,i (y[i − d_i1, j], y[i − d_i2, j])
    end
  end
end,

(3)

where Φ is a set of allowed couples (j,k), which give type and implementation order of operators in the algorithm (3).

As a result, the loop (1), which contains a set of different operators, can be transformed into the loop nest of three cycles (3), and its calculations are mapped into the three dimensional iteration space K³ = {1 ≤ i ≤ U_i, 1 ≤ j ≤ U_j, 1 ≤ k ≤ U_k } ⊂ Z³. The loop nest of higher dimensions can be derived analogously. Each operator is represented in the space K³ the vector K_i ∈ K³. The data dependence between operators, represented by K_i , K_l , is equivalent to the dependence vector D_j = K_i − K_l.

In the resulting structure each processing unit (PU) is specialized to implement a single function φ_k. The base set of such PUs for DSP applications contains the simplest PSs like adder, multiplier, ROM. Their local memory is the FIFO buffer, or a single result register.

III. MAPPING THE REGULAR LOOP NEST INTO THE PROCESSOR STRUCTURE

In the methods [3-5] the algorithm graph G_A is represented in the n-dimensional space Zⁿ . Hence G_A is the regular lattice graph. It is represented by its compact form of a set of different vectors-edges D_j of data dependences. If the loop nest contains a set of operators like (2), then the compact form is the synchronous dataflow graph (SDF) or the scalable SDF [6]. This oriented graph has N operators-nodes K_i, which are connected by respective dependence vectors-edges D_j. Consider the following algorithm:

for i = 1, N do
  for j = 1, M do
    St₁ : a[i,j] = b[i − 1, j − 1];
    St₂ : b[i,j] = a[i, j];
  end
end.

This algorithm is represented by the SDF graph shown in the Fig.1.

Vectors D₁ and D₂ represent the data a, b movings between operators St₁ and St₂. They labeled by the vectors of relative transfer delays (0, 0), and (1, 1). To represent the SDF graph G_AR in the n-dimensional space both the matrix D of the vectors D_j of data dependences and the matrix K of the vectors K_i are needed. Here the vector K i is equal to the coordinates of
the i-th operator node. An incident matrix A of the graph G_AR is needed to impress the linear dependence between both matrices K and D:

D = KA; (5)

A set of matrices K, D and A form, so called, algorithm configuration K_A. Due to its nature, the matrices A, K, D are tensors of both algorithm and resulting structure, and the equation (5) is the tensor equation. In [7] it is shown that the properties of many technical objects can be described by the tensor equation. Due to the tensor theory, the complex technical system can be described by its tensor. The tensor is the generalized matrix, which can be exchanged by the allowed transformations. Therefore, a set of different implementations of a system can be described by a tensor, and one implementation can be transferred to another one by some transformation of its tensor. The system synthesis consists in building of the tensor equation, and in directed search of such tensor transformation, which minimizes the effectiveness criteria. In this representation it is shown how to find the optimized structural solutions by the algorithm mapping using the principles of the tensor theory.

The next definitions and relations are true for the configuration K_A. Configuration K_A is correct, if K_i ≠ K_j ; i,j = 1,…,N, i ≠ j, i.e. if all the vectors-nodes are placed separately in the space Zⁿ.

A back linear dependence between configuration matrices is present, i.e.

K = D₀A₀^-1, (6)

where A₀ is the incidence matrix of the maximum spanning tree of the graph G_AR, D₀ is the matrix of the vectors-edges of this tree, including the base vector which connects the graph node with the coordinate system.

The sum of vectors-edges D_j, belongig to a graph cycle, must be equal to a zero, i.e. for the i-th cycle

Σ_jb_i,jD_j = 0, (7)

where b_ij is the element of the i-th row of the cyclomatic matrix of the graph G_AR.

Configurations C_A1 = (K₁, D₁, A₁) and C_A2 = (K₂, D₂, A₂) are equivalent if they are correct and represent an algorithm graph, i.e. A₁ = A₂. Correct configuration C_A1 is equivalent to the configuration C_A2 iff A₁ = A₂ and K₂ = F(K₁), where F is the injection function. For example, the following transformations give the equivalent configurations: vector K_i transposition in the space Zⁿ, row or column transposition of the matrix K₁, multiplication of the matrix K₁ to the non-singular matrix P.

Due to the tensor theory, any tensor object description must have the invariant tensor, which is immune to any tensor transformations. Here the matrix A and its submatrix A₀ represent the invariant tensors. The matrix K codes some variant of the synthesized structure. The structure optimization consists in generating of equivalent configurations, which are different in their matrices K, and in selection of the best one due to the some criterion.

The processor structure graph G_s is represented by its structure configuration C_As = (K_s, D_s, A), where K_s is the matrix of vectors-nodes K_Si ∈ Z^m, which give the PU coordinates, and D_S is the matrix of vectors-edges D_sj ∈ Z^m, which represent the connections between PUs, m < n.

The event configuration C_T = (K_T,D_T,A) consists of the matrix K_T of the vectors K_Ti∈ Z^n-m, matrix D_T of vectors D_Tj, and matrix A. Here vectors K_T represent the events of the operator implementation. In the correct configuration C_T vector D_Tj = K_Tl − K_Ti means that the operator, represented by K_Ti, must precede the operator, represented by K_Tl.

The timing function R(K_Ti) = t_i performs the mapping of the space of events Z^n-m to the time axis, and derives the time of the operator implementation.

The configuration C_T is correct, in other words, the precedence condition is true if for any couple of vectors K_Ti and K_Tl the inequality is true R(K_Tl) > R(K_Ti), where K_Ti precedes K_Tl.

If the function R is linear and monotonous one then the configuration C_T is correct iff D_Tj ≥ 0, j = 1,…,M, where D_Tj the vectors-nodes of the SDF, which are not marked by the relative transfer delays (or zeroed ones).

The function R(D_Tj) gives the delay between the variable computing in one PU and entering the another PU, i.e. the higher limit of the FIFO buffer length.

Consider the mapping of the algorithm (2) into the structure, which calculates the loop kernel in the pipelined mode with the period of L clock cycles. When this algorithm is represented in the three dimensional index space, the vectors K = (j, k, i)^T, where j,k,i means operator number, operator type, and cycle number respectively. Similarly the additional dimension q of the clock cycle is added to the algorithm configuration, then K = (j, k, i)^T. The vector-edge, which represents the interiteration dependence, is equal to D_b =(0,0,−p,0), where p is the distance between iterations.

Algorithm configuration C_A is equal to the composition of structure configuration C_S and event configuration C_T, and if K_l = (j, k, i, q)^T, then K_Sl = (j, k)^T and K_Tl = (i, q)^T. In the vector K_Sl = (j,k)^T , the coordinates j,k are equal to the PU number, where the l-th operator of the k-th type is implemented.

Firstly the space component of the mapping is searched. The matrix K_S forming is the combinatorial task. By this process M_K operators of k-th type are distributed among more than ]M_K/L[ PUs of the k-th type. In the matrix K_S M_S groups of equal columns are formed, each of them contains up to L columns, where M_S is the PU number in the resulting structure. The j-th PU has the maximum loading if the number of columns with the j-th coordinate is equal to L. Then the matrix D_S is derived from the equation D_S = K_SA.

The time component of the mapping represented by the matrices K_T and D_T is searched with respect to the conditions of the correctness of the algorithm configuration and event configuration, and equation (7). Besides, the algorithm is implemented correctly with the iteration period L iff

∀K_Ti ∈ K_T (K_Ti = (i, q)^T, i ≥ 0, q ∈ (0, 1,…,L-1)).

The strategies of searching of the space and timing components can be investigated in the following example of the structure synthesis.

IV. EXAMPLE OF THE PROCESSOR SYNTHESIS

Consider the synthesis of the second order IIR filter structure, which calculates the equation:

y[i] = x[i] + a ⋅ y[i−2] + b ⋅ y[i−1].

This equation is calculated by the following loop:

for i = 1, N do
  St₁: y₁ [i] = a*y[i−2];
  St₂: y₂ [i] = b*y[i−1];
  St₃: y₃ [i] = x[i] + y₁ [i];
  St₄: y[i] = y₂ [i] + y₃ [i];
end.

The SDF graph of this algorithm is shorn in Fig.2.

Each operator is calculated no less then a single clock cycle. The loaded edges mean the delays of the variable y[i] to one and two cycles, and could not express the delay of the operator St₄ . Therefore, in these edges additional nodes are set. The modified SDF graph is shown in the Fig.3.

Figure 2. Initial SDF graph of the IIR filter

Figure 3. Extended SDF graph of the IIR filter

This graph represents the following algorithm

for i = 1, N do
   St₁: y₁[i] = a*y₅[i−2];
   St₂: y₂[i] = b*y₆[i−1];
   St₃: y₃[i] = x[i] + y₁[i];
      St₄: y[i] = y₂[i] + y₃[1];
      St₅: y₅[i] = y[i−2];
      St₆: y₆[i] = y[i];
end.

The calculation period is L = 2, which means that a single couple of adder and multiplier can calculate it.

By the search of the space component the permissible coordinates K_si are set:

Here coordinates k = 0, 1, 2 mean multiplication, addition, equality operators. The matrix D_S is derived from the equation

When the time component of the mapping is searched, the known coordinates are set in the weighted vectors-edges D_T6 =(−2 0)sup>T and D_T7 = (−1 0)^T. The timing function is selected R=(L 1)=(2 1). To minimize the register number the vectors D_Tj, which leave the nodes 1,…,4 must have the coordinates providing R ⋅ D_Tj = 1 or 2, i.e. (0 1)^Т, (1 −1)^Т, or (1 0)^Т, which provide the monotony condition.

To provide the injection condition, the vectors K_Ti with equal coordinate q must be different, for example, when K_T1 = (X 0)^T, or (X 1)^T, then K_T2 = (X 1)^T, or (X 0)^T where X is unknown value. The coordinates q of the vectors D_Tj are derived from the set of equations:

D_T = K_T A;
D_T1 + D_T3 + D_T4 + D_T6 = 0;
D_T2 + D_T5 + D_T7 = 0.

Due to these conditions the following solution is found:

Figure 4. Algorithm configuration of the IIR filter

Figure 5. Structure configuration of the IIR filter

Fig.4 illustrates the derived algorithm configuration, and the Fig.5 does the respective structure configuration. This solution is distinguished by maximum hardware loading of the PUs and operation in the pipelined mode. It is the only structural solution of the second order IIR filter, in which the minimum clock cycle is equal to a single multiplier delay, and the input data run with the period of two cycles.

V. CONCLUSION

A method of application specific processor design is proposed which is based on the tensor theory of the system design. Its expansion was proven and widely used in the successive development of a set of DSP applications configured in the FPGAs, for example, published in [9, 10].

REFERENCES

[1] J.Isoaho, J.Pasanen, O.Vainio “DSP Sytem Integration and Prototyping With FPGAs” J. of VLSI Signal Processing. V 6, 1993, pp. 155-172.

[2] “System Generator for DSP. Getting Started Guide” August, 2007, 85p. See http://www.xilinx.com

[3] S.V.Rajopadhye “Synthesizing systolic arrays with control signals from recurrence equations” Distributed Computing. V3, 1989, pp. 88-105.

[4] J.Fortes, D.Moldovan “Data broadcasting in linearly scheduled array processors” Proc. 11 th Annual Symp. on Comp. Arch., 1984, pp. 224-231.

[5] S.Y.Kung “VLSI Array Processors” Eigenwood Cliffs, N.J.: Prentice Hall, 1988.

[6] S.Ritz, M.Pankert, and H.Meyr “Optimum vectorization of scalable synchronous dataflow graphs” Proc. Int. Conf. on Application Specific Array Processors. October. 1993.

[7] G.Kron “Tensor analysis of networks” MacDonald, London, 1965. 635 p.

[8 ] A. Sergyienko, O. Maslennikov. “Implementation of Givens QR Decomposition in FPGA” Lecture Notes in Computer Science, Springer, 2002, Vol. 2328, pp. 453-459.

[9] A. Sergyienko, V.Simoneko “DSP algorithm mapping into FPGAs” Proc. Int. Conf. Simulation-2006. Kiev. Energy Problem Modeling Institute of NAS of Ukraine. 2006. pp.189-193.

[10] O.Maslennikov, Ju. Shevtshenko, A. Sergiуenko “Configurable Microprocessor Array for DSP Applications” Lecture Notes in Computer Science. V. 3019. 2004. pp. 36-41.

⇓Завантажити PDF (eng.)