All of the above algorithms are such that the total work done by themisω(n 2 m 2 ). (The work done by a parallel algorithm that uses p processors and

Size: px
Start display at page:

Download "All of the above algorithms are such that the total work done by themisω(n 2 m 2 ). (The work done by a parallel algorithm that uses p processors and"

Transcription

1 Efficient Parallel Algorithms for Template Matching Sanguthevar Rajasekaran Department of CISE, University of Florida Abstract. The parallel complexity of template matching has been well studied. In this paper we present more work-efficient algorithms than the existing ones. Our algorithms are based on FFT primitives. We consider the following models of computing: PRAM and the hypercube. 1 Introduction The problem of template matching (also known as convolution) is a basic operation in computer vision and image processing and hence is of vital importance. For instance, template matching is used for edge and object detection, filtering, and image registration [8]. The parallel complexity of this problem has been well studied. In the 2-D version of this problem, we are given an image I[0 : n 1; 0:n 1] and a template T [0 : m 1; 0:m 1] of pixel values. The problem is to compute C[0 : n 1; 0:n 1] r=0 where C[i; j] = q=0 I[(i + q) modn; (j + r) modn] Λ T [q; r]. Fang and Ni [1] showed that 2-D template matching can be done in O(m 2 ) time using n 2 hypercube processors. They also extended this algorithm to the mesh and shuffle-exchnage networks to obtain similar time and processor bounds. Followed by this work, Prasanna Kumar and Krishnan [5] presented hyercube algorithms for 2-D template matching. They considered the cases when the individual processor-memory was O(m) and O(1), respectively. Their algorithm runs in time O(m 2 =k 2 +logn) using n 2 k 2 hypercube processors, for any 1» k» m. Ranka and Sahni [6, 7] have presented 2-D template matching algorithms for both SIMD and MIMD versions of the hypercube. They have also considered the cases when the individual processormemory is O(m) and O(1), respectively. For example, if each processor has O(m) memory, then their algorithm on the MIMD model runs in time O(m 2 + log(n=m)) using n 2 processors. 1

2 All of the above algorithms are such that the total work done by themisω(n 2 m 2 ). (The work done by a parallel algorithm that uses p processors and take t time is defined to be p t.) In this paper we present algorithms whose total work is either O(n 2 m log m) oro(n 2 m log n). If m>log n, our algorithms will perform better than previous algorithms. We apply FFT techniques. 2 Models of Computing In a Parallel Random Access Machine (PRAM), a number of processors work synchronously. They communicate with each other using a common block of global memory that is accessible by all. Communication is performed by writing into and/or reading from the common memory. EREW (Exclusive Read and Exclusive Write) PRAM is a version in which no concurrent read or write is allowed on any cellofthe global memory. A CREW (Concurrent Read and Exclusive Write) PRAM permits concurrent reads but not concurrent writes. The CRCW PRAM model allows both concurrent reads and concurrent writes. A hypercube of dimension d has 2 d processors. Each processor can be labeled with a d-bit binary number. Let v (i) stand for the binary number which differs from v only in the ith bit. We shall use the same symbol to denote a processor and its label. Any processor v in the hypercube is connected only to the processors v (i) for i =1; 2; ;d. There are two variants of the hypercube. In the first version, known as the sequential hypercube, single-port hypercube, or SIMD hypercube, it is assumed that in one unit of time a processor can communicate with only one of its neighbors. In contrast, the second version, known as the parallel hypercube, multiport hypercube, or MIMD hypercube assumes that in one unit of time a processor can communicate with all its d neighbors. We assume the sequential hypercube model wherein each processor has only O(1) local memory. 3 1-D Template Matching In the one-dimensional case, this problem is stated as follows: I[0 : n 1] is a given image and T [0 : m 1] is a template. Given I and T, compute C[0 : n 1], where C[i] = j=0 I[(i + j) modn] Λ T [j], for 0» i» n 1. 2

3 First consider the case when n = m. This computation can be reduced to polynomials multiplication and hence FFT computation. Let A(x) =I[0] + I[1]x + I[2]x I[m 1]x m 1 and B(x) =T [m 1] + T [m 2]x + T [m 3]x T [0]x m 1. If D(x) = A(x) Λ B(x) = d 0 + d 1 x + + d 2m 2 x 2m 2, then, C[0] = d m 1 and C[i] = d i+m 1 + d i 1, for i = 1; 2;:::;m 1. Thus, the C[i]'s can be computed in O(m logm) time sequentially if n = m using the fast Fourier transform (FFT) algorithms (see e.g., [2]). Now we consider the case when n > m. Obtaining C[ ] can now be reduced to d n e steps, where in each step, two polynomials of degree» m 1 are multiplied. m Assume w.l.o.g. that n = km for some integer k. Let A i (x) = I[im] + I[im + 1]x + I[im +2]x I[im + m 1]x m 1, for i = 0; 1; ;k 1. Let B(x) = T [m 1] + T [m 2]x + T [m 3]x T [0]x m 1. Multiply A i (x) withb(x), for 0» i» k 1. This will take atotalofo(n log m) time. Let D i (x) = A i (x) Λ B(x) = d i 0 + d i 1 x + + di 2m 2 x2m 2, for 0» i» k 1. Then, C[im] = d i m 1 for 0» i» k 1. Also, C[im + j] = d i (i+1) mod k m 1+j + d j 1, for 0» i» k 1 and 1» j» (m 1). Thus, C[ ] can be computed in O(n) time, given the coefficients of all the D i (x)'s D Template matching on the PRAM The following lemma pertains to computing FFTs on the PRAM [3]. Lemma 3.1 FFT of a vector of length m (and hence the product of two polynomials of degree m 1 each) can be computed in O(log m) time using m EREW PRAM processors. As a corollary to Lemma 3.1, we get: Theorem D template matching problem can be solved in O(log m) time using n EREW PRAM processors. Since the work done by the above algorithm is O(n log m), it is work-optimal D Template Matching on the Hypercube The following lemma will prove helpful in our presentation [4]: 3

4 Lemma 3.2 The following problems can be solved onann-node hypercube ino(log n) time: 1) Monotone routing (see [4] or [2] for a definition), when the data is of length n; 2) Broadcasting a data item; 3) Computing the FFT of a vector of length n; and 4) Computing the product of two polynomials of degree n each. Let H q denote a hypercube of dimension q. Assume that m =2` and n =2 r for some integers ` and r. Consider a hypercube H r. Let I[ ] be stored one element per node of H r. Also let T [ ]beinput one element per node of a subcube H` of H r. If we fix some q bits and vary the remaining bits of an r-bit number, the corresponding processors form a subcube H r q in H r. Thus H r consists of 2 q copies of the subcube H r q. Call the subcube H r whose first r ` bits have the value i as row i, 0» i» 2 r ` 1. Note also that all the nodes whose last ` bits are the same form a subcube H r `. Call the subcube whose last ` bits have the value j as column j, 0» j» 2` 1. Let T []be input in row 0. Broadcast T [ ] so that every row has a copy of T [ ]. This can be done by broadcasting T [j] in column j, for 0» j» 2` 1. Broadcasting takes O(r `) =O(log n) time. Now row i has A i (x) and B(x). Let row i multiply A i (x) and B(x) to obtain D i (x) using Lemma 3.2. The time needed is O(`) =O(log n) Let the result be stored two coefficients per node in row i. In particular, D i [0] and D i [m] willbeinnode0of row i. D i [1] and D i [m +1]will be in node 1 of row i. And so on. C[ ] can now be obtained by an appropriate shift of D[ ]. The shift itself corresponds to a monotone routing step and hence can be completed in O(log n) time. As a result we arrive at: Theorem D template matching problem can be solved on an n-node hypercube in O(log n) time. Here n is size of I[ ] and m is the size of T [ ]. 4 2-D Template Matching In the 2-D version of template matching, we are given an image I[0 : n 1; 0:n 1] and a template T [0 : m 1; 0 : m 1] of pixel values. The problem is to compute r=0 C[0 : n 1; 0:n 1] where C[i; j] = q=0 I[(i+q) modn; (j+r) modn]λt [q; r]. Here also, we can employ FFT techniques to obtain efficient parallel algorithms. 4

5 The idea is to perform a series of 1-D template matchings. In particular, compute the covolution of the qth row ofi[ ]with the rth row oft []for 0» q» (n 1); 0» r» (m 1). Let Q (q;r) [0 : n 1] be the result. Sequentially, P this computation will take time O(n 2 m 1 m log m). Now compute C[i; j] as Q((i+`) modn;`) `=0 [j]. Given the Q (q;r) 's, it takes only an additional O(n 2 m) time to compute all the C[i; j]'s. Thus the total sequential run time is O(n 2 m log m) D Template matching on the PRAM We assume the EREW PRAM model. For a given q and r, Q (q;r) [0 : n 1] can be computed in O(log m) time using n processors (c.f. Theorem 3.1). Thus, if we have n 2 m processors, all the Q (q;r) [0 : n 1]'s can be computed in O(log m) time. After this step, assign m processors to compute each of the C[i; j]'s. The m processors corresponding to any C[i; j] add up the m relevant items using the prefix algorithm in O(log m) time. As a result, we have: Theorem D template matching problem can be solved in O(log m) time using n 2 m EREW PRAM processors. Here n n is the image size and m m is the template size D Template matching on the Hypercube Assume that n =2 r for some integer r. Also assume that m =2` forsomeinteger `. We employ a hypercube with n 2 m =2 2r+` nodes. Let (i; j; Λ) stand for the subcube whose first r bits have avalue i, whose next r bits equal j, and whose last ` bits vary (0» i; j» 2 r 1). Similarly define (i; Λ;k) and (Λ;j;k). Let (i; Λ; Λ) stand for the subcube whose first r bits have the value i and whose other bits vary (0» i» 2 r 1). Similarly define (Λ; j; Λ) and(λ; Λ; k). The image I[ ] is input in the subcube (Λ; Λ; 0). The template is also input in the same subcube. Broadcast I[ ] and T [ ] so that each subcube (Λ; Λ;k) has a copy of I[ ]and T [],0» k» 2` 1. This can be done in time O(r). Subcube (Λ; Λ;k) has n 2 processors. These processors compute Q (u;k) [0 : n 1] for u =0; 1;:::;n 1. This is done in parallel for each k. The time taken is O(r). 5

6 Now C[i; j]'s can be computed by appropriate shifts along the `j-axis' in O(r) time and a prefix computation along the `k-axis' in O(`) time. Thus the total run time is O(r + `) =O(log n). Thus we have shown: Theorem D template matching can be solved in O(log n) time using n 2 m hypercube processors. Here n n is the size of the image and m m is the size of the template. In contrast to Theorem 4.2, Prasanna Kumar and Krishnan's [5] algorithm uses n 2 m 2 processors and runs in time O(log n). References [1] Z. Fang and L. Ni, Parallel Algorithms for 2-D Convolution, Proc. International Conference on Parallel Processing, 1986, pp [2] E. Horowitz, S. Sahni, and S. Rajasekaran, Computer Algorithms/C++, W. H. Freeman Press, [3] J. Já Já, An Introduction to Parallel Algorithms, Addison-Wesley Publishing Company, [4] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays- Trees-Hypercubes, Morgan Kaufmann Publishers, [5] V. Prasanna Kumar and V. Krishnan, Efficient Template Matching on SIMD Arrays, Proc. International Conference on Parallel Processing, 1987, pp [6] S. Ranka and S. Sahni, Image Template Matching on SIMD Hypercube Multicomputers, Proc. International Conference on Parallel Processing, 1988, pp [7] S. Ranka and S. Sahni, Image Template Matching on MIMD Hypercube Multicomputers, Proc. International Conference on Parallel Processing, 1988, pp [8] A. Rosenfeld and A. C. Kak, Digital Picture Processing, Academic Press,

ODD EVEN SHIFTS IN SIMD HYPERCUBES 1

ODD EVEN SHIFTS IN SIMD HYPERCUBES 1 ODD EVEN SHIFTS IN SIMD HYPERCUBES 1 Sanjay Ranka 2 and Sartaj Sahni 3 Abstract We develop a linear time algorithm to perform all odd (even) length circular shifts of data in an SIMD hypercube. As an application,

More information

R ij = 2. Using all of these facts together, you can solve problem number 9.

R ij = 2. Using all of these facts together, you can solve problem number 9. Help for Homework Problem #9 Let G(V,E) be any undirected graph We want to calculate the travel time across the graph. Think of each edge as one resistor of 1 Ohm. Say we have two nodes: i and j Let the

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #19 3/28/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class PRAM

More information

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI 1989 Sanjay Ranka and Sartaj Sahni 1 2 Chapter 1 Introduction 1.1 Parallel Architectures Parallel computers may

More information

PRAMs. M 1 M 2 M p. globaler Speicher

PRAMs. M 1 M 2 M p. globaler Speicher PRAMs A PRAM (parallel random access machine) consists of p many identical processors M,..., M p (RAMs). Processors can read from/write to a shared (global) memory. Processors work synchronously. M M 2

More information

Separating the Power of EREW and CREW PRAMs with Small Communication Width*

Separating the Power of EREW and CREW PRAMs with Small Communication Width* information and computation 138, 8999 (1997) article no. IC97649 Separating the Power of EREW and CREW PRAMs with Small Communication Width* Paul Beame Department of Computer Science and Engineering, University

More information

Large Integer Multiplication on Hypercubes. Barry S. Fagin Thayer School of Engineering Dartmouth College Hanover, NH

Large Integer Multiplication on Hypercubes. Barry S. Fagin Thayer School of Engineering Dartmouth College Hanover, NH Large Integer Multiplication on Hypercubes Barry S. Fagin Thayer School of Engineering Dartmouth College Hanover, NH 03755 barry.fagin@dartmouth.edu Large Integer Multiplication 1 B. Fagin ABSTRACT Previous

More information

Cost-Constrained Matchings and Disjoint Paths

Cost-Constrained Matchings and Disjoint Paths Cost-Constrained Matchings and Disjoint Paths Kenneth A. Berman 1 Department of ECE & Computer Science University of Cincinnati, Cincinnati, OH Abstract Let G = (V, E) be a graph, where the edges are weighted

More information

CSE Introduction to Parallel Processing. Chapter 2. A Taste of Parallel Algorithms

CSE Introduction to Parallel Processing. Chapter 2. A Taste of Parallel Algorithms Dr.. Izadi CSE-0 Introduction to Parallel Processing Chapter 2 A Taste of Parallel Algorithms Consider five basic building-block parallel operations Implement them on four simple parallel architectures

More information

Lecture 6 September 21, 2016

Lecture 6 September 21, 2016 ICS 643: Advanced Parallel Algorithms Fall 2016 Lecture 6 September 21, 2016 Prof. Nodari Sitchinava Scribe: Tiffany Eulalio 1 Overview In the last lecture, we wrote a non-recursive summation program and

More information

Two Fast Parallel GCD Algorithms of Many Integers. Sidi Mohamed SEDJELMACI

Two Fast Parallel GCD Algorithms of Many Integers. Sidi Mohamed SEDJELMACI Two Fast Parallel GCD Algorithms of Many Integers Sidi Mohamed SEDJELMACI Laboratoire d Informatique Paris Nord, France ISSAC 2017, Kaiserslautern, 24-28 July 2017 1 Motivations GCD of two integers: Used

More information

Complexity: Some examples

Complexity: Some examples Algorithms and Architectures III: Distributed Systems H-P Schwefel, Jens M. Pedersen Mm6 Distributed storage and access (jmp) Mm7 Introduction to security aspects (hps) Mm8 Parallel complexity (hps) Mm9

More information

Conquering Edge Faults in a Butterfly with Automorphisms

Conquering Edge Faults in a Butterfly with Automorphisms International Conference on Theoretical and Mathematical Foundations of Computer Science (TMFCS-0) Conquering Edge Faults in a Butterfly with Automorphisms Meghanad D. Wagh and Khadidja Bendjilali Department

More information

CSE 4502/5717 Big Data Analytics Spring 2018; Homework 1 Solutions

CSE 4502/5717 Big Data Analytics Spring 2018; Homework 1 Solutions CSE 502/5717 Big Data Analytics Spring 2018; Homework 1 Solutions 1. Consider the following algorithm: for i := 1 to α n log e n do Pick a random j [1, n]; If a[j] = a[j + 1] or a[j] = a[j 1] then output:

More information

Limitations of the QRQW and EREW PRAM Models?

Limitations of the QRQW and EREW PRAM Models? Limitations of the QRQW and EREW PRAM Models? Mirosław Kutyłowski, Heinz Nixdorf Institute and Dept. of Mathematics & Computer Science University of Paderborn, D-33095 Paderborn, Germany, mirekk@uni-paderborn.de

More information

A Digit-Serial Systolic Multiplier for Finite Fields GF(2 m )

A Digit-Serial Systolic Multiplier for Finite Fields GF(2 m ) A Digit-Serial Systolic Multiplier for Finite Fields GF( m ) Chang Hoon Kim, Sang Duk Han, and Chun Pyo Hong Department of Computer and Information Engineering Taegu University 5 Naeri, Jinryang, Kyungsan,

More information

Gray Codes for Torus and Edge Disjoint Hamiltonian Cycles Λ

Gray Codes for Torus and Edge Disjoint Hamiltonian Cycles Λ Gray Codes for Torus and Edge Disjoint Hamiltonian Cycles Λ Myung M. Bae Scalable POWERparallel Systems, MS/P963, IBM Corp. Poughkeepsie, NY 6 myungbae@us.ibm.com Bella Bose Dept. of Computer Science Oregon

More information

COMP 633: Parallel Computing Fall 2018 Written Assignment 1: Sample Solutions

COMP 633: Parallel Computing Fall 2018 Written Assignment 1: Sample Solutions COMP 633: Parallel Computing Fall 2018 Written Assignment 1: Sample Solutions September 12, 2018 I. The Work-Time W-T presentation of EREW sequence reduction Algorithm 2 in the PRAM handout has work complexity

More information

Tompa [7], von zur Gathen and Nocker [25], and Mnuk [16]. Recently, von zur Gathen and Shparlinski gave a lower bound of (log n) for the parallel time

Tompa [7], von zur Gathen and Nocker [25], and Mnuk [16]. Recently, von zur Gathen and Shparlinski gave a lower bound of (log n) for the parallel time A Sublinear-Time Parallel Algorithm for Integer Modular Exponentiation Jonathan P. Sorenson Department of Mathematics and Computer Science Butler University http://www.butler.edu/sorenson sorenson@butler.edu

More information

Arnaud Legrand, CNRS, University of Grenoble. November 1, LIG laboratory, Theoretical Parallel Computing. A.

Arnaud Legrand, CNRS, University of Grenoble. November 1, LIG laboratory, Theoretical Parallel Computing. A. Theoretical Arnaud Legrand, CNRS, University of Grenoble LIG laboratory, arnaudlegrand@imagfr November 1, 2009 1 / 63 Outline Theoretical 1 2 3 2 / 63 Outline Theoretical 1 2 3 3 / 63 Models for Computation

More information

Parallelism and Machine Models

Parallelism and Machine Models Parallelism and Machine Models Andrew D Smith University of New Brunswick, Fredericton Faculty of Computer Science Overview Part 1: The Parallel Computation Thesis Part 2: Parallelism of Arithmetic RAMs

More information

Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun

Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun spcl.inf.ethz.ch @spcl_eth Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun Design of Parallel and High-Performance Computing Fall 2017 DPHPC Overview cache coherency memory models 2 Speedup An application

More information

ARecursive Doubling Algorithm. for Solution of Tridiagonal Systems. on Hypercube Multiprocessors

ARecursive Doubling Algorithm. for Solution of Tridiagonal Systems. on Hypercube Multiprocessors ARecursive Doubling Algorithm for Solution of Tridiagonal Systems on Hypercube Multiprocessors Omer Egecioglu Department of Computer Science University of California Santa Barbara, CA 936 Cetin K Koc Alan

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 13 Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel Numerical Algorithms

More information

1 1 0, g Exercise 1. Generator polynomials of a convolutional code, given in binary form, are g

1 1 0, g Exercise 1. Generator polynomials of a convolutional code, given in binary form, are g Exercise Generator polynomials of a convolutional code, given in binary form, are g 0, g 2 0 ja g 3. a) Sketch the encoding circuit. b) Sketch the state diagram. c) Find the transfer function TD. d) What

More information

OPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS

OPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS Parallel Processing Letters, c World Scientific Publishing Company OPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS COSTAS S. ILIOPOULOS Department of Computer Science, King s College London,

More information

On Embeddings of Hamiltonian Paths and Cycles in Extended Fibonacci Cubes

On Embeddings of Hamiltonian Paths and Cycles in Extended Fibonacci Cubes American Journal of Applied Sciences 5(11): 1605-1610, 2008 ISSN 1546-9239 2008 Science Publications On Embeddings of Hamiltonian Paths and Cycles in Extended Fibonacci Cubes 1 Ioana Zelina, 2 Grigor Moldovan

More information

Pancycles and Hamiltonian-Connectedness of the Hierarchical Cubic Network

Pancycles and Hamiltonian-Connectedness of the Hierarchical Cubic Network Pancycles and Hamiltonian-Connectedness of the Hierarchical Cubic Network Jung-Sheng Fu Takming College, Taipei, TAIWAN jsfu@mail.takming.edu.tw Gen-Huey Chen Department of Computer Science and Information

More information

Sequential Synchronous Circuit Analysis

Sequential Synchronous Circuit Analysis Sequential Synchronous Circuit Analysis General Model Current State at time (t) is stored in an array of flip-flops. Next State at time (t+1) is a Boolean function of State and Inputs. Outputs at time

More information

Implementation of the DKSS Algorithm for Multiplication of Large Numbers

Implementation of the DKSS Algorithm for Multiplication of Large Numbers Implementation of the DKSS Algorithm for Multiplication of Large Numbers Christoph Lüders Universität Bonn The International Symposium on Symbolic and Algebraic Computation, July 6 9, 2015, Bath, United

More information

Beta Operations: Efficient Implementation of a Primitive Parallel Operation

Beta Operations: Efficient Implementation of a Primitive Parallel Operation August 1986 Hcport No. STAN-CS-86 1129 Beta Operations: Efficient Implementation of a Primitive Parallel Operation by lcv:rn It. Cohn ;III~ fk~~mcy W. H;lddad Department of Computer Science Sl;~r~Tcml

More information

On a Parallel Lehmer-Euclid GCD Algorithm

On a Parallel Lehmer-Euclid GCD Algorithm On a Parallel Lehmer-Euclid GCD Algorithm Sidi Mohammed Sedjelmaci LIPN CNRS UPRES-A 7030, Université Paris-Nord 93430 Villetaneuse, France. e-mail: sms@lipn.univ-paris13.fr ABSTRACT A new version of Euclid

More information

2 Introduction to Network Theory

2 Introduction to Network Theory Introduction to Network Theory In this section we will introduce some popular families of networks and will investigate how well they allow to support routing.. Basic network topologies The most basic

More information

Program Performance Metrics

Program Performance Metrics Program Performance Metrics he parallel run time (par) is the time from the moment when computation starts to the moment when the last processor finished his execution he speedup (S) is defined as the

More information

Parallel Prefix Algorithms 1. A Secret to turning serial into parallel

Parallel Prefix Algorithms 1. A Secret to turning serial into parallel Parallel Prefix Algorithms. A Secret to turning serial into parallel 2. Suppose you bump into a parallel algorithm that surprises you there is no way to parallelize this algorithm you say 3. Probably a

More information

Sequential and Parallel Algorithms for the All-Substrings Longest Common Subsequence Problem

Sequential and Parallel Algorithms for the All-Substrings Longest Common Subsequence Problem Sequential and Parallel Algorithms for the All-Substrings Longest Common Subsequence Problem C. E. R. Alves Faculdade de Tecnologia e Ciências Exatas Universidade São Judas Tadeu São Paulo-SP-Brazil prof.carlos

More information

Hardware Design I Chap. 4 Representative combinational logic

Hardware Design I Chap. 4 Representative combinational logic Hardware Design I Chap. 4 Representative combinational logic E-mail: shimada@is.naist.jp Already optimized circuits There are many optimized circuits which are well used You can reduce your design workload

More information

Efficient Distributed Quantum Computing

Efficient Distributed Quantum Computing Efficient Distributed Quantum Computing Steve Brierley Heilbronn Institute, Dept. Mathematics, University of Bristol October 2013 Work with Robert Beals, Oliver Gray, Aram Harrow, Samuel Kutin, Noah Linden,

More information

Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms

Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms Algorithms for Collective Communication Design and Analysis of Parallel Algorithms Source A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing, Chapter 4, 2003. Outline One-to-all

More information

Ph219/CS219 Problem Set 3

Ph219/CS219 Problem Set 3 Ph19/CS19 Problem Set 3 Solutions by Hui Khoon Ng December 3, 006 (KSV: Kitaev, Shen, Vyalyi, Classical and Quantum Computation) 3.1 The O(...) notation We can do this problem easily just by knowing the

More information

Fast and Processor Efficient Parallel Matrix Multiplication Algorithms on a Linear Array With a Reconfigurable Pipelined Bus System

Fast and Processor Efficient Parallel Matrix Multiplication Algorithms on a Linear Array With a Reconfigurable Pipelined Bus System IEEE TRASACTIOS O PARALLEL AD DISTRIBUTED SYSTEMS, VOL. 9, O. 8, AUGUST 1998 705 Fast and Processor Efficient Parallel Matrix Multiplication Algorithms on a Linear Array With a Reconfigurable Pipelined

More information

A Complexity Theory for Parallel Computation

A Complexity Theory for Parallel Computation A Complexity Theory for Parallel Computation The questions: What is the computing power of reasonable parallel computation models? Is it possible to make technology independent statements? Is it possible

More information

Fast Convolution; Strassen s Method

Fast Convolution; Strassen s Method Fast Convolution; Strassen s Method 1 Fast Convolution reduction to subquadratic time polynomial evaluation at complex roots of unity interpolation via evaluation at complex roots of unity 2 The Master

More information

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE General e Image Coder Structure Motion Video x(s 1,s 2,t) or x(s 1,s 2 ) Natural Image Sampling A form of data compression; usually lossless, but can be lossy Redundancy Removal Lossless compression: predictive

More information

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression Sergio De Agostino Sapienza University di Rome Parallel Systems A parallel random access machine (PRAM)

More information

Fast Decoding Of Alternant Codes Using A Divison-Free Analog Of An Accelerated Berlekamp-Massey Algorithm

Fast Decoding Of Alternant Codes Using A Divison-Free Analog Of An Accelerated Berlekamp-Massey Algorithm Fast Decoding Of Alternant Codes Using A Divison-Free Analog Of An Accelerated Berlekamp-Massey Algorithm MARC A. ARMAND WEE SIEW YEN Department of Electrical & Computer Engineering National University

More information

Computing Maximum Subsequence in Parallel

Computing Maximum Subsequence in Parallel Computing Maximum Subsequence in Parallel C. E. R. Alves 1, E. N. Cáceres 2, and S. W. Song 3 1 Universidade São Judas Tadeu, São Paulo, SP - Brazil, prof.carlos r alves@usjt.br 2 Universidade Federal

More information

Fast, Parallel Algorithm for Multiplying Polynomials with Integer Coefficients

Fast, Parallel Algorithm for Multiplying Polynomials with Integer Coefficients , July 4-6, 01, London, UK Fast, Parallel Algorithm for Multiplying Polynomials with Integer Coefficients Andrzej Chmielowiec Abstract This paper aims to develop and analyze an effective parallel algorithm

More information

Compute the Fourier transform on the first register to get x {0,1} n x 0.

Compute the Fourier transform on the first register to get x {0,1} n x 0. CS 94 Recursive Fourier Sampling, Simon s Algorithm /5/009 Spring 009 Lecture 3 1 Review Recall that we can write any classical circuit x f(x) as a reversible circuit R f. We can view R f as a unitary

More information

Pseudo-Random Numbers Generators. Anne GILLE-GENEST. March 1, Premia Introduction Definitions Good generators...

Pseudo-Random Numbers Generators. Anne GILLE-GENEST. March 1, Premia Introduction Definitions Good generators... 14 pages 1 Pseudo-Random Numbers Generators Anne GILLE-GENEST March 1, 2012 Contents Premia 14 1 Introduction 2 1.1 Definitions............................. 2 1.2 Good generators..........................

More information

Optimization Techniques for Parallel Code 1. Parallel programming models

Optimization Techniques for Parallel Code 1. Parallel programming models Optimization Techniques for Parallel Code 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 Goals of

More information

Lecture 2 September 4, 2014

Lecture 2 September 4, 2014 CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 2 September 4, 2014 Scribe: David Liu 1 Overview In the last lecture we introduced the word RAM model and covered veb trees to solve the

More information

Algorithmic Approach to Counting of Certain Types m-ary Partitions

Algorithmic Approach to Counting of Certain Types m-ary Partitions Algorithmic Approach to Counting of Certain Types m-ary Partitions Valentin P. Bakoev Abstract Partitions of integers of the type m n as a sum of powers of m (the so called m-ary partitions) and their

More information

Efficient parallel algorithms for optical computing with the discrete Fourier transform DFT primitive

Efficient parallel algorithms for optical computing with the discrete Fourier transform DFT primitive Efficient parallel algorithms for optical computing with the discrete Fourier transform DFT primitive John H. Reif and Akhilesh Tyagi Optical-computing technology offers new challenges to algorithm designers

More information

Simultaneous Linear, and Non-linear Congruences

Simultaneous Linear, and Non-linear Congruences Simultaneous Linear, and Non-linear Congruences CIS002-2 Computational Alegrba and Number Theory David Goodwin david.goodwin@perisic.com 09:00, Friday 18 th November 2011 Outline 1 Polynomials 2 Linear

More information

Introduction. An Introduction to Algorithms and Data Structures

Introduction. An Introduction to Algorithms and Data Structures Introduction An Introduction to Algorithms and Data Structures Overview Aims This course is an introduction to the design, analysis and wide variety of algorithms (a topic often called Algorithmics ).

More information

Algorithms (II) Yu Yu. Shanghai Jiaotong University

Algorithms (II) Yu Yu. Shanghai Jiaotong University Algorithms (II) Yu Yu Shanghai Jiaotong University Chapter 1. Algorithms with Numbers Two seemingly similar problems Factoring: Given a number N, express it as a product of its prime factors. Primality:

More information

Definition: Alternating time and space Game Semantics: State of machine determines who

Definition: Alternating time and space Game Semantics: State of machine determines who CMPSCI 601: Recall From Last Time Lecture Definition: Alternating time and space Game Semantics: State of machine determines who controls, White wants it to accept, Black wants it to reject. White wins

More information

Optimal and Sublogarithmic Time. Sanguthevar Rajasekaran 2. John H. Reif 2. Abstract.We assume a parallel RAM model which allows both concurrent reads

Optimal and Sublogarithmic Time. Sanguthevar Rajasekaran 2. John H. Reif 2. Abstract.We assume a parallel RAM model which allows both concurrent reads Optimal and Sublogarithmic Time Randomized Parallel Sorting Algorithms 1 Sanguthevar Rajasekaran 2 John H. Reif 2 Aiken Computation Laboratory, Harvard University, Cambridge, MA 02138. Abstract.We assume

More information

COL 730: Parallel Programming

COL 730: Parallel Programming COL 730: Parallel Programming PARALLEL SORTING Bitonic Merge and Sort Bitonic sequence: {a 0, a 1,, a n-1 }: A sequence with a monotonically increasing part and a monotonically decreasing part For some

More information

arxiv: v1 [cs.na] 8 Feb 2016

arxiv: v1 [cs.na] 8 Feb 2016 Toom-Coo Multiplication: Some Theoretical and Practical Aspects arxiv:1602.02740v1 [cs.na] 8 Feb 2016 M.J. Kronenburg Abstract Toom-Coo multiprecision multiplication is a well-nown multiprecision multiplication

More information

GF(2 m ) arithmetic: summary

GF(2 m ) arithmetic: summary GF(2 m ) arithmetic: summary EE 387, Notes 18, Handout #32 Addition/subtraction: bitwise XOR (m gates/ops) Multiplication: bit serial (shift and add) bit parallel (combinational) subfield representation

More information

Exercises - Solutions

Exercises - Solutions Chapter 1 Exercises - Solutions Exercise 1.1. The first two definitions are equivalent, since we follow in both cases the unique path leading from v to a sink and using only a i -edges leaving x i -nodes.

More information

LCNS, Vol 762, pp , Springer 1993

LCNS, Vol 762, pp , Springer 1993 On the Power of Reading and Writing Simultaneously in Parallel Compations? Rolf Niedermeier and Peter Rossmanith?? Fakultat fur Informatik, Technische Universitat Munchen Arcisstr. 21, 80290 Munchen, Fed.

More information

Optimal Extension Field Inversion in the Frequency Domain

Optimal Extension Field Inversion in the Frequency Domain Optimal Extension Field Inversion in the Frequency Domain Selçuk Baktır, Berk Sunar WPI, Cryptography & Information Security Laboratory, Worcester, MA, USA Abstract. In this paper, we propose an adaptation

More information

The Design Procedure. Output Equation Determination - Derive output equations from the state table

The Design Procedure. Output Equation Determination - Derive output equations from the state table The Design Procedure Specification Formulation - Obtain a state diagram or state table State Assignment - Assign binary codes to the states Flip-Flop Input Equation Determination - Select flipflop types

More information

Finite Fields. SOLUTIONS Network Coding - Prof. Frank H.P. Fitzek

Finite Fields. SOLUTIONS Network Coding - Prof. Frank H.P. Fitzek Finite Fields In practice most finite field applications e.g. cryptography and error correcting codes utilizes a specific type of finite fields, namely the binary extension fields. The following exercises

More information

Fast Polynomial Multiplication

Fast Polynomial Multiplication Fast Polynomial Multiplication Marc Moreno Maza CS 9652, October 4, 2017 Plan Primitive roots of unity The discrete Fourier transform Convolution of polynomials The fast Fourier transform Fast convolution

More information

An Embedding of Multiple Edge-Disjoint Hamiltonian Cycles on Enhanced Pyramid Graphs

An Embedding of Multiple Edge-Disjoint Hamiltonian Cycles on Enhanced Pyramid Graphs Journal of Information Processing Systems, Vol.7, No.1, March 2011 DOI : 10.3745/JIPS.2011.7.1.075 An Embedding of Multiple Edge-Disjoint Hamiltonian Cycles on Enhanced Pyramid Graphs Jung-Hwan Chang*

More information

Analysis and Synthesis of Weighted-Sum Functions

Analysis and Synthesis of Weighted-Sum Functions Analysis and Synthesis of Weighted-Sum Functions Tsutomu Sasao Department of Computer Science and Electronics, Kyushu Institute of Technology, Iizuka 820-8502, Japan April 28, 2005 Abstract A weighted-sum

More information

Boolean circuits. Lecture Definitions

Boolean circuits. Lecture Definitions Lecture 20 Boolean circuits In this lecture we will discuss the Boolean circuit model of computation and its connection to the Turing machine model. Although the Boolean circuit model is fundamentally

More information

Stream Ciphers. Çetin Kaya Koç Winter / 20

Stream Ciphers. Çetin Kaya Koç   Winter / 20 Çetin Kaya Koç http://koclab.cs.ucsb.edu Winter 2016 1 / 20 Linear Congruential Generators A linear congruential generator produces a sequence of integers x i for i = 1,2,... starting with the given initial

More information

Constant Space and Non-Constant Time in Distributed Computing

Constant Space and Non-Constant Time in Distributed Computing Constant Space and Non-Constant Time in Distributed Computing Tuomo Lempiäinen and Jukka Suomela Aalto University, Finland OPODIS 20th December 2017 Lisbon, Portugal 1 / 19 Time complexity versus space

More information

Matching Partition a Linked List and Its Optimization

Matching Partition a Linked List and Its Optimization Matching Partition a Linked List and Its Otimization Yijie Han Deartment of Comuter Science University of Kentucky Lexington, KY 40506 ABSTRACT We show the curve O( n log i + log (i) n + log i) for the

More information

Parallel Integer Polynomial Multiplication Changbo Chen, Svyatoslav Parallel Integer Covanov, Polynomial FarnamMultiplication

Parallel Integer Polynomial Multiplication Changbo Chen, Svyatoslav Parallel Integer Covanov, Polynomial FarnamMultiplication Parallel Integer Polynomial Multiplication Parallel Integer Polynomial Multiplication Changbo Chen 1 Svyatoslav Covanov 2,3 Farnam Mansouri 2 Marc Moreno Maza 2 Ning Xie 2 Yuzhen Xie 2 1 Chinese Academy

More information

CprE 281: Digital Logic

CprE 281: Digital Logic CprE 281: Digital Logic Instructor: Alexander Stoytchev http://www.ece.iastate.edu/~alexs/classes/ Synchronous Sequential Circuits Basic Design Steps CprE 281: Digital Logic Iowa State University, Ames,

More information

Frequency-domain representation of discrete-time signals

Frequency-domain representation of discrete-time signals 4 Frequency-domain representation of discrete-time signals So far we have been looing at signals as a function of time or an index in time. Just lie continuous-time signals, we can view a time signal as

More information

Chapter 1. Comparison-Sorting and Selecting in. Totally Monotone Matrices. totally monotone matrices can be found in [4], [5], [9],

Chapter 1. Comparison-Sorting and Selecting in. Totally Monotone Matrices. totally monotone matrices can be found in [4], [5], [9], Chapter 1 Comparison-Sorting and Selecting in Totally Monotone Matrices Noga Alon Yossi Azar y Abstract An mn matrix A is called totally monotone if for all i 1 < i 2 and j 1 < j 2, A[i 1; j 1] > A[i 1;

More information

This is a repository copy of Improving the associative rule chaining architecture.

This is a repository copy of Improving the associative rule chaining architecture. This is a repository copy of Improving the associative rule chaining architecture. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/75674/ Version: Accepted Version Book Section:

More information

Notes on Computer Theory Last updated: November, Circuits

Notes on Computer Theory Last updated: November, Circuits Notes on Computer Theory Last updated: November, 2015 Circuits Notes by Jonathan Katz, lightly edited by Dov Gordon. 1 Circuits Boolean circuits offer an alternate model of computation: a non-uniform one

More information

Introduction to Quantum Computing

Introduction to Quantum Computing Introduction to Quantum Computing The lecture notes were prepared according to Peter Shor s papers Quantum Computing and Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a

More information

CS 310 Advanced Data Structures and Algorithms

CS 310 Advanced Data Structures and Algorithms CS 310 Advanced Data Structures and Algorithms Runtime Analysis May 31, 2017 Tong Wang UMass Boston CS 310 May 31, 2017 1 / 37 Topics Weiss chapter 5 What is algorithm analysis Big O, big, big notations

More information

A General Lower Bound on the I/O-Complexity of Comparison-based Algorithms

A General Lower Bound on the I/O-Complexity of Comparison-based Algorithms A General Lower ound on the I/O-Complexity of Comparison-based Algorithms Lars Arge Mikael Knudsen Kirsten Larsent Aarhus University, Computer Science Department Ny Munkegade, DK-8000 Aarhus C. August

More information

compare to comparison and pointer based sorting, binary trees

compare to comparison and pointer based sorting, binary trees Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:

More information

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication. CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax

More information

15.1 Proof of the Cook-Levin Theorem: SAT is NP-complete

15.1 Proof of the Cook-Levin Theorem: SAT is NP-complete CS125 Lecture 15 Fall 2016 15.1 Proof of the Cook-Levin Theorem: SAT is NP-complete Already know SAT NP, so only need to show SAT is NP-hard. Let L be any language in NP. Let M be a NTM that decides L

More information

A GENERAL POLYNOMIAL SIEVE

A GENERAL POLYNOMIAL SIEVE A GENERAL POLYNOMIAL SIEVE SHUHONG GAO AND JASON HOWELL Abstract. An important component of the index calculus methods for finding discrete logarithms is the acquisition of smooth polynomial relations.

More information

CPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication

CPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication CPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication March, 2006 1 Introduction We have now seen that the Fast Fourier Transform can be applied to perform

More information

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache

More information

Determine the size of an instance of the minimum spanning tree problem.

Determine the size of an instance of the minimum spanning tree problem. 3.1 Algorithm complexity Consider two alternative algorithms A and B for solving a given problem. Suppose A is O(n 2 ) and B is O(2 n ), where n is the size of the instance. Let n A 0 be the size of the

More information

A Lower Bound Technique for Nondeterministic Graph-Driven Read-Once-Branching Programs and its Applications

A Lower Bound Technique for Nondeterministic Graph-Driven Read-Once-Branching Programs and its Applications A Lower Bound Technique for Nondeterministic Graph-Driven Read-Once-Branching Programs and its Applications Beate Bollig and Philipp Woelfel FB Informatik, LS2, Univ. Dortmund, 44221 Dortmund, Germany

More information

Chapter 5 Divide and Conquer

Chapter 5 Divide and Conquer CMPT 705: Design and Analysis of Algorithms Spring 008 Chapter 5 Divide and Conquer Lecturer: Binay Bhattacharya Scribe: Chris Nell 5.1 Introduction Given a problem P with input size n, P (n), we define

More information

Optimizing the Dimensional Method for Performing Multidimensional, Multiprocessor, Out-of-Core FFTs

Optimizing the Dimensional Method for Performing Multidimensional, Multiprocessor, Out-of-Core FFTs Dartmouth College Computer Science Technical Report TR2001-402 Optimizing the Dimensional Method for Performing Multidimensional, Multiprocessor, Out-of-Core FFTs Jeremy T Fineman Dartmouth College Department

More information

Generalized DCell Structure for Load-Balanced Data Center Networks

Generalized DCell Structure for Load-Balanced Data Center Networks Generalized DCell Structure for Load-Balanced Data Center Networks Markus Kliegl, Jason Lee, Jun Li, Xinchao Zhang, Chuanxiong Guo, David Rincón Swarthmore College, Duke University, Fudan University, Shanghai

More information

HAMMING DISTANCE FROM IRREDUCIBLE POLYNOMIALS OVER F Introduction and Motivation

HAMMING DISTANCE FROM IRREDUCIBLE POLYNOMIALS OVER F Introduction and Motivation HAMMING DISTANCE FROM IRREDUCIBLE POLYNOMIALS OVER F 2 GILBERT LEE, FRANK RUSKEY, AND AARON WILLIAMS Abstract. We study the Hamming distance from polynomials to classes of polynomials that share certain

More information

On Pattern Matching With Swaps

On Pattern Matching With Swaps On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: +968 23237200 Fax: +968 23237720

More information

Optimal Folding Of Bit Sliced Stacks+

Optimal Folding Of Bit Sliced Stacks+ Optimal Folding Of Bit Sliced Stacks+ Doowon Paik University of Minnesota Sartaj Sahni University of Florida Abstract We develop fast polynomial time algorithms to optimally fold stacked bit sliced architectures

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

Algorithm Design and Analysis

Algorithm Design and Analysis Algorithm Design and Analysis LECTURE 14 Divide and Conquer Fast Fourier Transform Sofya Raskhodnikova 10/7/2016 S. Raskhodnikova; based on slides by K. Wayne. 5.6 Convolution and FFT Fast Fourier Transform:

More information

Remainders. We learned how to multiply and divide in elementary

Remainders. We learned how to multiply and divide in elementary Remainders We learned how to multiply and divide in elementary school. As adults we perform division mostly by pressing the key on a calculator. This key supplies the quotient. In numerical analysis and

More information