All of the above algorithms are such that the total work done by themisω(n 2 m 2 ). (The work done by a parallel algorithm that uses p processors and

Size: px

Start display at page:

Download "All of the above algorithms are such that the total work done by themisω(n 2 m 2 ). (The work done by a parallel algorithm that uses p processors and"

Nicholas Phillip Wilson
5 years ago
Views:

1 Efficient Parallel Algorithms for Template Matching Sanguthevar Rajasekaran Department of CISE, University of Florida Abstract. The parallel complexity of template matching has been well studied. In this paper we present more work-efficient algorithms than the existing ones. Our algorithms are based on FFT primitives. We consider the following models of computing: PRAM and the hypercube. 1 Introduction The problem of template matching (also known as convolution) is a basic operation in computer vision and image processing and hence is of vital importance. For instance, template matching is used for edge and object detection, filtering, and image registration [8]. The parallel complexity of this problem has been well studied. In the 2-D version of this problem, we are given an image I[0 : n 1; 0:n 1] and a template T [0 : m 1; 0:m 1] of pixel values. The problem is to compute C[0 : n 1; 0:n 1] r=0 where C[i; j] = q=0 I[(i + q) modn; (j + r) modn] Λ T [q; r]. Fang and Ni [1] showed that 2-D template matching can be done in O(m 2 ) time using n 2 hypercube processors. They also extended this algorithm to the mesh and shuffle-exchnage networks to obtain similar time and processor bounds. Followed by this work, Prasanna Kumar and Krishnan [5] presented hyercube algorithms for 2-D template matching. They considered the cases when the individual processor-memory was O(m) and O(1), respectively. Their algorithm runs in time O(m 2 =k 2 +logn) using n 2 k 2 hypercube processors, for any 1» k» m. Ranka and Sahni [6, 7] have presented 2-D template matching algorithms for both SIMD and MIMD versions of the hypercube. They have also considered the cases when the individual processormemory is O(m) and O(1), respectively. For example, if each processor has O(m) memory, then their algorithm on the MIMD model runs in time O(m 2 + log(n=m)) using n 2 processors. 1

2 All of the above algorithms are such that the total work done by themisω(n 2 m 2 ). (The work done by a parallel algorithm that uses p processors and take t time is defined to be p t.) In this paper we present algorithms whose total work is either O(n 2 m log m) oro(n 2 m log n). If m>log n, our algorithms will perform better than previous algorithms. We apply FFT techniques. 2 Models of Computing In a Parallel Random Access Machine (PRAM), a number of processors work synchronously. They communicate with each other using a common block of global memory that is accessible by all. Communication is performed by writing into and/or reading from the common memory. EREW (Exclusive Read and Exclusive Write) PRAM is a version in which no concurrent read or write is allowed on any cellofthe global memory. A CREW (Concurrent Read and Exclusive Write) PRAM permits concurrent reads but not concurrent writes. The CRCW PRAM model allows both concurrent reads and concurrent writes. A hypercube of dimension d has 2 d processors. Each processor can be labeled with a d-bit binary number. Let v (i) stand for the binary number which differs from v only in the ith bit. We shall use the same symbol to denote a processor and its label. Any processor v in the hypercube is connected only to the processors v (i) for i =1; 2; ;d. There are two variants of the hypercube. In the first version, known as the sequential hypercube, single-port hypercube, or SIMD hypercube, it is assumed that in one unit of time a processor can communicate with only one of its neighbors. In contrast, the second version, known as the parallel hypercube, multiport hypercube, or MIMD hypercube assumes that in one unit of time a processor can communicate with all its d neighbors. We assume the sequential hypercube model wherein each processor has only O(1) local memory. 3 1-D Template Matching In the one-dimensional case, this problem is stated as follows: I[0 : n 1] is a given image and T [0 : m 1] is a template. Given I and T, compute C[0 : n 1], where C[i] = j=0 I[(i + j) modn] Λ T [j], for 0» i» n 1. 2

3 First consider the case when n = m. This computation can be reduced to polynomials multiplication and hence FFT computation. Let A(x) =I[0] + I[1]x + I[2]x I[m 1]x m 1 and B(x) =T [m 1] + T [m 2]x + T [m 3]x T [0]x m 1. If D(x) = A(x) Λ B(x) = d 0 + d 1 x + + d 2m 2 x 2m 2, then, C[0] = d m 1 and C[i] = d i+m 1 + d i 1, for i = 1; 2;:::;m 1. Thus, the C[i]'s can be computed in O(m logm) time sequentially if n = m using the fast Fourier transform (FFT) algorithms (see e.g., [2]). Now we consider the case when n > m. Obtaining C[ ] can now be reduced to d n e steps, where in each step, two polynomials of degree» m 1 are multiplied. m Assume w.l.o.g. that n = km for some integer k. Let A i (x) = I[im] + I[im + 1]x + I[im +2]x I[im + m 1]x m 1, for i = 0; 1; ;k 1. Let B(x) = T [m 1] + T [m 2]x + T [m 3]x T [0]x m 1. Multiply A i (x) withb(x), for 0» i» k 1. This will take atotalofo(n log m) time. Let D i (x) = A i (x) Λ B(x) = d i 0 + d i 1 x + + di 2m 2 x2m 2, for 0» i» k 1. Then, C[im] = d i m 1 for 0» i» k 1. Also, C[im + j] = d i (i+1) mod k m 1+j + d j 1, for 0» i» k 1 and 1» j» (m 1). Thus, C[ ] can be computed in O(n) time, given the coefficients of all the D i (x)'s D Template matching on the PRAM The following lemma pertains to computing FFTs on the PRAM [3]. Lemma 3.1 FFT of a vector of length m (and hence the product of two polynomials of degree m 1 each) can be computed in O(log m) time using m EREW PRAM processors. As a corollary to Lemma 3.1, we get: Theorem D template matching problem can be solved in O(log m) time using n EREW PRAM processors. Since the work done by the above algorithm is O(n log m), it is work-optimal D Template Matching on the Hypercube The following lemma will prove helpful in our presentation [4]: 3

4 Lemma 3.2 The following problems can be solved onann-node hypercube ino(log n) time: 1) Monotone routing (see [4] or [2] for a definition), when the data is of length n; 2) Broadcasting a data item; 3) Computing the FFT of a vector of length n; and 4) Computing the product of two polynomials of degree n each. Let H q denote a hypercube of dimension q. Assume that m =2` and n =2 r for some integers ` and r. Consider a hypercube H r. Let I[ ] be stored one element per node of H r. Also let T [ ]beinput one element per node of a subcube H` of H r. If we fix some q bits and vary the remaining bits of an r-bit number, the corresponding processors form a subcube H r q in H r. Thus H r consists of 2 q copies of the subcube H r q. Call the subcube H r whose first r ` bits have the value i as row i, 0» i» 2 r ` 1. Note also that all the nodes whose last ` bits are the same form a subcube H r `. Call the subcube whose last ` bits have the value j as column j, 0» j» 2` 1. Let T []be input in row 0. Broadcast T [ ] so that every row has a copy of T [ ]. This can be done by broadcasting T [j] in column j, for 0» j» 2` 1. Broadcasting takes O(r `) =O(log n) time. Now row i has A i (x) and B(x). Let row i multiply A i (x) and B(x) to obtain D i (x) using Lemma 3.2. The time needed is O(`) =O(log n) Let the result be stored two coefficients per node in row i. In particular, D i [0] and D i [m] willbeinnode0of row i. D i [1] and D i [m +1]will be in node 1 of row i. And so on. C[ ] can now be obtained by an appropriate shift of D[ ]. The shift itself corresponds to a monotone routing step and hence can be completed in O(log n) time. As a result we arrive at: Theorem D template matching problem can be solved on an n-node hypercube in O(log n) time. Here n is size of I[ ] and m is the size of T [ ]. 4 2-D Template Matching In the 2-D version of template matching, we are given an image I[0 : n 1; 0:n 1] and a template T [0 : m 1; 0 : m 1] of pixel values. The problem is to compute r=0 C[0 : n 1; 0:n 1] where C[i; j] = q=0 I[(i+q) modn; (j+r) modn]λt [q; r]. Here also, we can employ FFT techniques to obtain efficient parallel algorithms. 4

5 The idea is to perform a series of 1-D template matchings. In particular, compute the covolution of the qth row ofi[ ]with the rth row oft []for 0» q» (n 1); 0» r» (m 1). Let Q (q;r) [0 : n 1] be the result. Sequentially, P this computation will take time O(n 2 m 1 m log m). Now compute C[i; j] as Q((i+`) modn;`) `=0 [j]. Given the Q (q;r) 's, it takes only an additional O(n 2 m) time to compute all the C[i; j]'s. Thus the total sequential run time is O(n 2 m log m) D Template matching on the PRAM We assume the EREW PRAM model. For a given q and r, Q (q;r) [0 : n 1] can be computed in O(log m) time using n processors (c.f. Theorem 3.1). Thus, if we have n 2 m processors, all the Q (q;r) [0 : n 1]'s can be computed in O(log m) time. After this step, assign m processors to compute each of the C[i; j]'s. The m processors corresponding to any C[i; j] add up the m relevant items using the prefix algorithm in O(log m) time. As a result, we have: Theorem D template matching problem can be solved in O(log m) time using n 2 m EREW PRAM processors. Here n n is the image size and m m is the template size D Template matching on the Hypercube Assume that n =2 r for some integer r. Also assume that m =2` forsomeinteger `. We employ a hypercube with n 2 m =2 2r+` nodes. Let (i; j; Λ) stand for the subcube whose first r bits have avalue i, whose next r bits equal j, and whose last ` bits vary (0» i; j» 2 r 1). Similarly define (i; Λ;k) and (Λ;j;k). Let (i; Λ; Λ) stand for the subcube whose first r bits have the value i and whose other bits vary (0» i» 2 r 1). Similarly define (Λ; j; Λ) and(λ; Λ; k). The image I[ ] is input in the subcube (Λ; Λ; 0). The template is also input in the same subcube. Broadcast I[ ] and T [ ] so that each subcube (Λ; Λ;k) has a copy of I[ ]and T [],0» k» 2` 1. This can be done in time O(r). Subcube (Λ; Λ;k) has n 2 processors. These processors compute Q (u;k) [0 : n 1] for u =0; 1;:::;n 1. This is done in parallel for each k. The time taken is O(r). 5

6 Now C[i; j]'s can be computed by appropriate shifts along the `j-axis' in O(r) time and a prefix computation along the `k-axis' in O(`) time. Thus the total run time is O(r + `) =O(log n). Thus we have shown: Theorem D template matching can be solved in O(log n) time using n 2 m hypercube processors. Here n n is the size of the image and m m is the size of the template. In contrast to Theorem 4.2, Prasanna Kumar and Krishnan's [5] algorithm uses n 2 m 2 processors and runs in time O(log n). References [1] Z. Fang and L. Ni, Parallel Algorithms for 2-D Convolution, Proc. International Conference on Parallel Processing, 1986, pp [2] E. Horowitz, S. Sahni, and S. Rajasekaran, Computer Algorithms/C++, W. H. Freeman Press, [3] J. Já Já, An Introduction to Parallel Algorithms, Addison-Wesley Publishing Company, [4] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays- Trees-Hypercubes, Morgan Kaufmann Publishers, [5] V. Prasanna Kumar and V. Krishnan, Efficient Template Matching on SIMD Arrays, Proc. International Conference on Parallel Processing, 1987, pp [6] S. Ranka and S. Sahni, Image Template Matching on SIMD Hypercube Multicomputers, Proc. International Conference on Parallel Processing, 1988, pp [7] S. Ranka and S. Sahni, Image Template Matching on MIMD Hypercube Multicomputers, Proc. International Conference on Parallel Processing, 1988, pp [8] A. Rosenfeld and A. C. Kak, Digital Picture Processing, Academic Press,

ODD EVEN SHIFTS IN SIMD HYPERCUBES 1

ODD EVEN SHIFTS IN SIMD HYPERCUBES 1 Sanjay Ranka 2 and Sartaj Sahni 3 Abstract We develop a linear time algorithm to perform all odd (even) length circular shifts of data in an SIMD hypercube. As an application,