THE discrete sine transform (DST) and the discrete cosine

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-II: EXPRESS BIREFS 1 New Systolic Algorithm and Array Architecture for Prime-Length Discrete Sine Transform Pramod K. Meher Senior Member, IEEE and M. N. S. Swamy Fellow, IEEE Abstract Using a simple input regeneration approach and index transformation techniques, a new formulation is presented in this paper for computing an N-point prime-length discrete sine transform (DST) through two pairs of [(N 1)/4]-point cyclic convolutions, where [(N 1)/4] is an odd number. The cyclic convolution-based algorithm is used further to obtain a simple regular and locally connected linear systolic array for concurrent pipelined implementation of the DST. It is shown that the proposed systolic structure involves significantly less areatime complexity compared with that of the existing structures. Index Terms Discrete sine transform (DST), Discrete cosine transform (DCT), systolic array, very-large-scale integration (VLSI). I. INTRODUCTION THE discrete sine transform (DST) and the discrete cosine transform (DCT), have key functions in signal and image processing systems not only for their near optimal transform coding behaviour, but also for several other applications, e.g., block filtering, transform-domain adaptive filtering, digital signal interpolation, adaptive beamforming and image resizing etc. [1] [4]. For transform coding application the usual block size is 8, but for most other applications it is quite useful to have the DST and the DCT of prime transform-lengths and composite transform-lengths (consisting of two or more relatively prime factors). While there are only limited choices of power-of-two transform-lengths, prime-factor approach provides closely spaced suitable choices of transform-lengths [5], [6]. Moreover, the prime-factor approach not only offers scalability for hardware, time and transform size, but also involves significantly less area-time complexity compared to that of the direct implementation of long-length transforms. Various algorithms and architectures are therefore suggested in the literature for efficient computation of prime-length DCT and DST, and to combine them efficiently for the computation of transforms of composite transform-lengths [5], [6]. Several attempts have been made in the recent years for efficient implementation of prime-length DST and DCT in systolic hardware through cyclic convolutional formulation [7] [11] due to its remarkable advantage over the others, particularly for efficient input/output and data transfer operations. Manuscript submitted on July 21, 26, Revised October 6, 26. P. K. Meher is with the School of Computer Engineering, Nanyang Technological University, 5 Nanyang Avenue, Singapore, 639798. Email: aspkmeher@ntu.edu.sg, URL: http://www.ntu.edu.sg/home/aspkmeher/. M. N. S. Swamy is with the Department of Electrical & Computer Engineering, Concordia University, 1455 de Maisonneuve Blvd. West, Montreal, Quebec, Canada H3G 1M8. Email: swamy@ece.concordia.ca. Copyright (c) 26 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org. Apart from this, the convolutional representation of the DST is found to be more suitable for memory- and adder-based systolic realization. Attempts have, therefore, been made to implement N-point DST more efficiently through a pair of [(N 1)/2]-point cyclic convolutions [1] or convolution-like computation [9]. Using a new input regeneration scheme and simple index transformation techniques, in this paper, we formulate a low-complexity concurrent algorithm for conversion of an N point prime-length DST into a pair of [(N 1)/2]- point exact cyclic convolutions; and each of those [(N 1)/2]- point cyclic convolutions is further reduced to a pair of [(N 1)/4]-point cyclic convolutions when [(N 1)/4] is odd. The proposed convolution-based algorithm is used to derive a simple and regular area-time efficient linear systolic array for prime-length DST. The low-complexity convolutional formulation is derived in the next Section, and illustrated in Section III. The proposed systolic architecture is derived in Section IV. Hardware and time complexities of the proposed design are estimated and compared with the exiting structures in Section V. Conclusions are presented in Section VI. II. CONVOLUTIONAL FORMULATION OF THE DST The DST of a sequence {y(n), n N 1} can be defined as X(k) N 1 n π(2n + 1)k y(n)sin 2N, 1 k N. (1) Using the properties of sine and cosine functions, for any positive integer n, one can find that π(2n + 1)k sin 2N 1+2 n cos i1 πki πk sin. (2) N 2N Substituting (2) on (1), the DST can also be expressed as X(N) x(), (3a) βk and X(k) [2S(k) +x()] sin, (3b) 2 N 1 where S(k) x(n) cos βkn, (3c) n1 for k 1, 2,...,N 1, β π/n, and the input sequence {x(n)} is generated by successive accumulation given by and x(n 1) y(n 1), (3d)

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-II: EXPRESS BIREFS 2 for n 2,...,N. x(n n) y(n n)+x(n n + 1), (3e) n 1 n mod K, n 1 <K, (9a) n 2 n mod 2, n 1 and 1 (9b) When N is any odd number, the even- and odd-indexed components of the intermediate result S(k) can be separated out from (1), and the even and odd-indexed terms in each of those components can be combined by using the symmetry of cosine functions to have where S(2k) P (k) S(N 2k) Q(k) (N 1)/2 n1 (N 1)/2 n1 p(n) [x(2n)+x(n 2n)] q(n) [x(2n) x(n 2n)] p(n) cos 2βφ(k, n), (4a) q(n) cos 2βφ(k, n), (4b) (4c) (4d) for k, n 1, 2,...,(N 1)/2, and φ(k, n) in the argument of cosine function is given by (2kn)N, if (2kn) N M. (5) φ(k, n) N (2kn) N, if (2kn) N >M. The symbol (.) N in (5), denotes modulo (N) operation and M (N 1)/2. When the transform length N is prime, each of the two sequences {P (k} and {Q(k)}, for 1 k M, given by (4) can be converted into [(N 1)/2] point circular convolution by suitable permutations achieved by mapping the indices k to l and n to m according to the following equations: and n k (η m ) N, if (2kn) N M. N (η m ) N, otherwise. (η l ) N, if (2kn) N M. N (η l ) N, otherwise. where η is the (N 1)-th primitive root of unity, such that η N 1 mod N 1and η j mod N 1, for <j<(n 1). Using the mapping given by (6) and (7)), each of the sequences {P (k)} and {Q(k}, of (4) may thus be expressed as an M point cyclic convolution of the form y(k) M 1 n (6) (7) h(n) r(k n) M, (8) The input sequence {r(n)} in (8) corresponds to one of the two sequences {p(n)} and {q(n)}, {h(n)} corresponds to the fixed coefficients {cos(2βφ(k, n)} and the convolved output sequence corresponds to {P (k)} and {Q(k)} of (4a) and (4b), respectively. When M 2K, and K is any odd number, each of the sequences {r(n)}, {h(n)} and {y(n)} can be converted into a 2-dimensional array of size K 2 by mapping the index n to the pair (n 1,n 2 ) using the Chinese remainder theorem (CRT) according to the relations: where the inverse mapping from (n 1,n 2 ) to n is performed by the relation: n (n 1 s 1 + n 2 s 2 )modm, n<m, (1) for s 1 1modK, s 2 1mod2, s 1 mod2and s 2 modk. Using the mapping of (9), the cyclic convolution of (8) may be converted into a two-dimensional form [12]: y(k 1,k 2 ) n 1 n 2 1 h(n 1,n 2 )r(k 1 n 1,k 2 n 2 ), (11) where the indices (k 1 n 1 ) and (k 2 n 2 ) of r(n 1,n 2 ) are understood to be taken mod K and mod 2, respectively. The 2-point convolutions in (11) can then be expanded to: y(k 1, ) n 1 h(n 1, )r(k 1 n 1, ) + h(n 1, 1)r(k 1 n 1, 1), y(k 1, 1) n 1 h(n 1, )r(k 1 n 1, 1) + h(n 1, 1)r(k 1 n 1, ), From (12), one can obtain where d(k 1, ) d(k 1, 1) y(k 1, ) d(k 1, ) + d(k 1, 1), y(k 1, 1) d(k 1, ) d(k 1, 1). n 1 n 1 (12a) (12b) (13a) (13b) a(n 1, )b (k 1 n 1 ) K,, (13c) a(n 1, 1)b (k 1 n 1 ) K, 1, (13d) a(n 1, ) [h(n 1, ) + h(n 1, 1)]/2, a(n 1, 1) [h(n 1, ) h(n 1, 1)]/2, b(n 1, ) r(n 1, ) + r(n 1, 1), b(n 1, 1) r(n 1, ) h(n 1, 1). (13e) (13f) (13g) (13h) for k 1,n 1 <K. The M point cyclic convolution of (8) may, therefore, be computed from a pair of M/2 point cyclic convolutions of (13c) and (13d). The pair of [(N 1)/2] point cyclic convolutions of (4a) and (4b) for N point DST may, thus, be computed from two pairs of [(N 1)/4]-point cyclic convolutions. It may be noted here that, as shown in [13], An N point DCT can also be converted to a form similar to that of (3); and it can then be converted into two pairs of [(N 1)/4]-point cyclic convolutions similar to the case of the DST as discussed above in this Section.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-II: EXPRESS BIREFS 3 III. EXAMPLE OF CONVERSION OF DST INTO CIRCULAR CONVOLUTION FORM For simple illustration of the proposed formulation, we show here the conversion of 13-point DST into two pairs of 3-point cyclic convolutions. For transform-length N 13, we can write (4a) and (4b) in a common matrix-vector form: U(1) U(2) U(3) U(4) U(5) U(6) c 2 c 4 c 6 c 5 c 3 c 1 c 4 c 5 c 1 c 3 c 6 c 2 c 6 c 1 c 5 c 2 c 4 c 3 c 5 c 3 c 2 c 6 c 1 c 4 c 3 c 6 c 4 c 1 c 2 c 5 c 1 c 2 c 3 c 4 c 5 c 6 u(1) u(2) u(3) u(4) u(5) u(6) (14) where, c i cos(2iβ) and β ( π 13 ) for 1 i 6. Equation (14) represents (4a) when U(i) P (i) and u(i) p(i), while it represents (4b) when U(i) Q(i) and u(i) q(i) for 1 i 6. To convert (14) into the desired circular convolution, we can find the primitive root of unity η to be 2 for N 13, and can map the indices according to (6) and (7) as shown in Tables I. TABLE I MAPPING OF INDICES k AND n. l and m 1 2 3 4 5 n 1 6 3 5 4 2 k 1 2 4 5 3 6 TABLE II MAPPING OF INDEX n TO (n 1,n 2 ). n 1 2 3 4 5 (n 1,n 2 ) (, ) (1, 1) (2, ) (, 1) (1, ) (2, 1) Using the mapping of Table I and the commutative property of cyclic convolution, (14) can be written as a 6-point cyclic convolution of the form: y() r() r(5) r(4) r(3) r(2) r(1) h() y(1) r(1) r() r(5) r(4) r(3) r(2) h(1) y(2) y(3) r(2) r(1) r() r(5) r(4) r(3) h(2) r(3) r(2) r(1) r() r(5) r(4) h(3) (15) y(4) r(4) r(3) r(2) r(1) r() r(5) h(4) y(5) r(5) r(4) r(3) r(2) r(1) r() h(5) where [y() y(1) y(2) y(3) y(4) y(5)] T [U(1) U(2) U(4) U(5) U(3) U(6)] T, [r() r(1) r(2) r(3) r(4) r(5)] T [u(1) u(6) u(3) u(5) u(4) u(2)] T and [h() h(1) h(2) h(3) h(4) h(5)] T [c 2 c 4 c 5 c 3 c 6 c 1 ] T. The 6-point sequences {y(n)}, {h(n)} and {r(n)} of (15) can, respectively, be mapped into 3 2 matrices [y(n 1,n 2 ], [h(n 1,n 2 ] and [r(n 1,n 2 ] according to (9) as shown in Table II. The 6-point cyclic convolution of (15), may then be computed according to (13a) and (13b) from a pair of 3-point cyclic convolutions of (16). d(,i) d(1,i) d(2,i) b(,i) b(2,i) b(1,i) b(1,i) b(,i) b(2,i) b(2,i) b(1,i) b(,i) a(,i) a(1,i) (16) a(2,i) for i and 1, where, (a(n 1,i) and (b(n 1,i) for n 1 2, and i and 1 are computed according to (13e, 13f) and (13g, 13h), respectively. A 13-point DST of (4a) and (4b) may thus be obtained from two pairs of 3-point circular convolutions as given by (16). IV. PROPOSED SYSTOLIC ARRAY ARCHITECTURE A simple and regular locally connected linear systolic structure is derived here for the computation of a 6-point cyclic-convolution according to (13) by concurrent implementation of a pair of 3-point cyclic convolutions of (16). The proposed convolution structure can be used further for the computation of 13-point DST according to (3) and (4). The dependence graphs (DG) for computation of a pair of 3- point convolutions of (16) are shown Figs. 1(a) and 1(b), respectively. The function of each node is depicted in Fig. 1(c). The fixed multiplying coefficients {a(n 1, )} and {a(n 1, 1)} for n 1 2 are pre-computed according to (13e) and (13f). {b(n 1, )} and {b(n 1, 1)} are computed by an input adder according to (13g) and (13h), respectively, and made available to the the nodes of the DG. Since both of these DGs have identical functions, they can be merged together and projected along the j direction with a schedule along [1 1] T to obtain a linear systolic array consisting of 3 PEs in the i direction. According to the requirement of systolic transformation, the multiplying coefficients available to the nodes of the DG from direction [ 1] T stay in the PEs, the input values available from direction [1 1] T are transferred to the next PE with 2 delays, and the partial result available to each of the nodes from [1 ] T direction is moved to the next PEs in the subsequent cycles. The proposed linear systolic array to realize a pair of 3-point cyclic convolutions is shown in Fig. 2. The input values are fed to the individual PEs through a circularly-extended input interface, such that the j b(, ) b(2, ) b(1, ) i a(,) a(1,) a(2,) b(1, ) b(2, ) Zin Xin (a) Yin Yout Xout Zout d(,) d(1,) d(2,) (c) b(, 1) b(2, 1) b(1, 1) a(,1) a(1,1) a(2,1) Yout Yin; Zout Zin. b(1, ) b(2, ) (b) Xout Xin + Zin. Yin; d(,1) d(1,1) d(2,1) Fig. 1. The DGs for computation of {d(n 1, )} and {d(n 1, 1)} for 1 n 1 3, given by circular convolution form of (13). (a) The DG for {d(n 1, )} (b) The DG for {d(n 1, 1)}. (c) Function of each node of the DGs.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-II: EXPRESS BIREFS 4 input values to a PE are staggered by one cycle-period with respect to the preceding PE to maintain the data dependency requirement. Function of each PE of the structure is shown in Fig. 2(b). Each of the PEs performs a pair of multiplications and a pair of additions in each cycle period, where the multiplications in a PE are always performed with a pair of fixed coefficients. This feature of the PEs can be utilized to implement the multiplications in each PE by a pair of look-uptable (LUT) ROMs that store the product values for all possible input values for the given pair of multiplying coefficients of the PE. The structure of the proposed memory-based PEs is shown in Fig.3. It consists of two dual-port ROMs, each of size 2 words, where L is the word length. Each dual-port ROM serves as look-up-table for all the possible values of the product. The bits of each of the input words Uin and V in are separated into two equal halves of bits each and the two halves of each of the two input words are fed in parallel to the pair of address ports of a dual-port ROM as shown in Fig.3. The more significant halves of the output of the ROM are left-shifted by () bit positions and added with their other halves to generate the desired product values by a pair of shift-add cells. The pair of outputs of the shift-add cells are added with X1in and X2in to generate a pair of outputs of the PE. As shown in Fig. 3, the function of the proposed memory-based PEs can be implemented in three pipelined stages. The duration of a cycle period of the memory-based implementation can be T max(t Mem,T AS ), where T Mem and T AS, are the time required for each memory-read operation and the time required for a shift-add operation in the PEs. The actual duration of the cycle period of the memory-based PE will, however, depend on the word-length L, and how the adders and memory elements are implemented in the PEs. In multiply-accumulate implementation, if each PE is designed to have two multipliers and two adders, then the duration of the cycle period would be T (T M + T A ), where T M and T A are, respectively, the times involved in performing one multiplication and one addition in the PE. The right-most PE of the structure yields its first output 3 cycles after the first input arrives at its left-most PE, and produces its subsequent output in every cycle period thereafter. It delivers one pair of convolved output sequences in every 3 cycle periods once the pipeline is filled in the first 3 cycles. For implementation of a pair of [(N 1)/2]-point cyclic convolutions, in general, one can have two such linear arrays, each consisting of [(N 1)/4] PEs. A pair of [(N 1)/2]- point cyclic convolutions associated with the computation of an N point DST may, however, be time-multiplexed into the same structure to be computed one after the other. V. HARDWARE AND TIME COMPLEXITIES In Section II, it was shown that an N-point DST can be computed via two [(N 1)/2]-point cyclic convolutions. Moreover, when [(N 1)/4] is odd, each of the [(N 1)/2]- point cyclic convolutions can be computed by a pair of [(N 1)/4]-point cyclic convolutions using a linear systolic array of [(N 1)/4] PEs. The structure would yield its first pair of convolution output after a latency of [2 + (N 1)/4] cycles. It would provide a pair of output in every cycle after b(,) b(,1) b(1,) b(1,1) b(2,) b(2,1) Δ Δ 2Δ 2Δ d(2,) d(1,) d(,) PE-1 PE-2 PE-3 d(2,1) d(1,1) d(,1) X1in X2in Uin PE Vin X2out (a) X1in + C1. Uin; X 2out X 2in + C2. Vin; Fig. 2. The proposed linear systolic array for the computing a pair of 3- point cyclic convolutions. (a) The linear array. (b) Function of each PE. U in b(i 1, ), V in b(i 1, 1), 1 i 3, for the i-th PE. C1 and C2 are constants for a given PE. stands for unit delay. X1in X2in Fig. 3. Uin Vin STAGE-1 DUAL-PORT ROM OF SIZE 2 DUAL-PORT ROM OF SIZE 2 STAGE-2 SHIFT ADD SHIFT ADD STAGE-3 Proposed structure of the memory-based processing elements. ADDER ADDER X2out the latency period. A complete set of convolved output can be computed in every [(N 1)/2] cycles, where the duration of a cycle period T (T M + T A ) for multiplier-based implementation and T max(t Mem,T AS ) for the memorybased implementation. The hardware- and time-complexities of the proposed systolic realization along with those of the existing systolic structures for DST and DCT of [7] [9] and [14] are listed in Table III. It is found that the proposed structure involves the same average computation-time (ACT) as that of [9], but needs nearly half the number of multipliers and adders used in the latter. Although the proposed structure involves nearly the same number of multipliers and adders as that of [7], it has half the ACT of the latter. The structure of [14] involves nearly twice the ACT and double the number of adders as that of proposed structure. Even though the structure of [8] has nearly the same number of multipliers as the proposed one, it has 3 times the number of adders. Furthermore, the cycle period of the proposed structure is only T M +T A compared to T M +3T A of that of the [8], even though the ACT involves the same number of computational cycles for both. Also, the proposed structure requires the same number of I/O channels as that of [8], which is comparable to that of the other structures as well. The hardware- and time-complexities of the memory-based realization of the proposed systolic structure is listed in Table IV, along with those of the recently proposed memory-based structure for the DST [1]. The proposed structure requires the same number of cycles of ACT as that of [1], but it involves half the ROM size and nearly half the number of adders,

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-II: EXPRESS BIREFS 5 TABLE III HARDWARE- AND TIME-COMPLEXITIES OF PROPOSED STRUCTURE AND THE EXISTING MULTIPLIER-BASED SYSTOLIC STRUCTURES. Structures Multipliers Adders Registers Cycle-Time (T ) Latency ACT I/O Channels Chiper et al. [9] (N 1) (N +1) 5(N 1)/2 T M + T A 2(N 1)T (N 1)T/2 3L + N Chiper [7] (N 1)/2 (N 1)/2 5(N 1)/2 T M + T A (3N 2)T (N 1)T 3L +1 Fang and Wu [14] (N/2+3) (N +3) (11N +4) T M +2T A 7NT/2 NT 3L +1 Cheng and Parhi [8] (N 1)/2 3(N 1)/2 3(N 3) 4 T M +3T A (5N 1)T/4 (N 1)T/2 4L +1 Proposed (N +3)/2 (N +11)/2 2(N 1) T M + T A 2(N 1)T (N 1)T/2 4L +1 TABLE IV HARDWARE- AND TIME-COMPLEXITIES OF PROPOSED STRUCTURE AND THE EXISTING MEMORY-BASED SYSTOLIC STRUCTURE. Structures Multipliers Adders ROM (Words) Cycle-Time (T ) Latency ACT I/O Channels (N 1) Chiper et al. [1] 2 2N +3 2.2 (+1) T Mem + T A 2(N 1)T (N 1)T/2 7L +1 (N 1) Proposed 2 N +5 4.2 (+1) max(t Mem,T AS ) 2(N 1)T (N 1)T/2 4L +1 a lower cycle period and less number of I/O channels as those of other. Unlike that of [1], the proposed convolutional formulation provides a much simpler input regeneration by successive accumulation according to (3d) and (3e). Besides, no additional computation is needed in the proposed formulation to obtain the last DST component X(N), since it is same as the regenerated input value x() as given by (3a). It may also be noted that, unlike the existing structures of [8] and [9], the proposed structure does not involve any tag-bit control for sign alterations for realization of convolution-like operations. VI. CONCLUSIONS A new convolutional formulation is derived to compute prime-length DST of size N from a pair of [(N 1)/2]-point cyclic convolutions. Moreover, a reduced-complexity recursive algorithm is proposed for systolization of each [(N 1)/2]- point cyclic convolution through a pair of [(N 1)/4]-point cyclic convolutions, when [(N 1)/4] is an odd number. A simple, regular and locally connected linear array is presented for concurrent pipelined systolic implementation of those cyclic convolutions. It offers a significantly lower area-time complexity compared with that of the existing structures. It is interesting to note that the proposed convolutional formulation eliminates the use of control tag-bits, which are otherwise involved in most of the existing structures. The proposed scheme is found to to be suitable for efficient memory-based implementation of DST using ROM-based look-up-tables. It would also be useful for DST implementation based on distributed arithmetic and constant multiplications schemes like canonical signed digit multiplication and CORDIC multiplication. The proposed convolutional formulation and the systolic structure for prime-length DST can be directly used for implementation of the DCT, as well. In this paper, we have used a new and simple input regeneration scheme, which interestingly yields the last DST component, as the first regenerated input value. REFERENCES [1] Z. Wang, G. A. Jullien, and W. C. Miller, Interpolation using the discrete sine transform with increased accuracy, Electronics Letters, vol. 29, no. 22, pp. 1918 192, Oct. 1993. [2] S. A. Martucci and R. Mersereau, New approaches to block filtering of images using symmetric convolution and the DST or DCT, in Proc. IEEE International Symp. Circuits and Systems (ISCAS 93), May 1993, pp. 259 262. [3] F. Beaufays, Transform-domain adaptive filters: an analytical approach, IEEE Trans. Signal Processing, vol. SP-43, no. 2, pp. 422 431, Feb. 1995. [4] Y. S. Park and H. W. Park, Arbitrary-ratio image resizing using fast DCT of composite length for DCT-based transcoder, IEEE Trans. Image Processing, vol. 15, no. 2, pp. 494 5, Feb. 26. [5] A. Tatsaki, C. Dre, T. Stouraitis, and C. Goutis, Prime-factor DCT algorithms, IEEE Trans. Signal Processing, vol. SP-43, no. 3, pp. 772 776, Mar. 1995. [6] D. Kar and V. V. B. Rao, On the prime factor decomposition algorithm for the discrete sine transform, IEEE Trans. Signal Processing, vol. SP-42, no. 11, pp. 3258 326, Nov. 1994. [7] D. F. Chiper, A new systolic array algorithm for memory-based VLSI array implementation of DCT, in Proc. Second IEEE Symp. on Computers and Communications, July 1997, pp. 297 31. [8] C. Cheng and K. K. Parhi, A novel systolic array structure for DCT, IEEE Trans. Circuits Syst-II: Express Briefs, vol. 52, no. 7, pp. 366 369, July 25. [9] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis, A systolic array architecture for the discrete sine transform, IEEE Trans. Signal Process., vol. 5, no. 9, pp. 2347 2354, Sept. 22. [1], Systolic algorithms and a memory-based design approach for a unified architecture for the computation of DCT/DST/IDCT/IDST, IEEE Trans. Circuits Syst-I: Regular Papers, vol. 52, no. 6, pp. 1125 1137, June 25. [11] P. K. Meher, Systolic designs for DCT using a low-complexity concurrent convolutional formulation, IEEE Trans. Circuits & Systems for Video Technology, vol. 16, no. 9, pp. 141 15, Sept. 26. [12] R. C. Agarwal and J. W. Cooley, New algorithms for digital convolution, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-25, no. 5, pp. 392 41, Oct. 1977. [13] J.-I. Guo, C.-M. Liu, and C.-W. Jen, A new array architecture for prime-length discrete cosine transform, IEEE Trans. Signal Processing, vol. 41, no. 1, pp. 436 442, Jan. 1993. [14] W. H. Fang and M. L. Wu, Unified fully-pipelined implementations of one- and two-dimensional real discrete trigonometric transforms, IEICE Trans. Fund. Electron., Commun. Comput. Sci., vol. E82-A, no. 1, pp. 2219 223, Oct. 1999.