Yale university technical report #1402.

Similar documents
Fast Random Projections

Accelerated Dense Random Projections

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Matrix-Vector Multiplication in Sub-Quadratic Time (Some Preprocessing Required)

Fast Random Projections using Lean Walsh Transforms Yale University Technical report #1390

Fast Dimension Reduction

Faster Johnson-Lindenstrauss style reductions

Dense Fast Random Projections and Lean Walsh Transforms

Speeding up the Four Russians Algorithm by About One More Logarithmic Factor

A fast randomized algorithm for overdetermined linear least-squares regression

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

Complexity Theory of Polynomial-Time Problems

Dimensionality reduction: Johnson-Lindenstrauss lemma for structured random matrices

Fast Matrix Computations via Randomized Sampling. Gunnar Martinsson, The University of Colorado at Boulder

Very Sparse Random Projections

Three Ways to Test Irreducibility

A Note on Perfect Partial Elimination

Randomized algorithms for the low-rank approximation of matrices

Three Ways to Test Irreducibility

Fast Dimension Reduction

Edo Liberty. Accelerated Dense Random Projections

16 Embeddings of the Euclidean metric

Using the Johnson-Lindenstrauss lemma in linear and integer programming

Chapter 2. Divide-and-conquer. 2.1 Strassen s algorithm

MANY scientific computations, signal processing, data analysis and machine learning applications lead to large dimensional

Randomized Algorithms

On the Exponent of the All Pairs Shortest Path Problem

A fast randomized algorithm for the approximation of matrices preliminary report

R ij = 2. Using all of these facts together, you can solve problem number 9.

Divide and Conquer. Maximum/minimum. Median finding. CS125 Lecture 4 Fall 2016

Fast Output-Sensitive Matrix Multiplication

Section Summary. Sequences. Recurrence Relations. Summations Special Integer Sequences (optional)

Sketching as a Tool for Numerical Linear Algebra

Optimal compression of approximate Euclidean distances

Computational complexity and some Graph Theory

Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes

csci 210: Data Structures Program Analysis

Copyright 2000, Kevin Wayne 1

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

Generalizing of ahigh Performance Parallel Strassen Implementation on Distributed Memory MIMD Architectures

Fast Matrix Multiplication Over GF3

There are two things that are particularly nice about the first basis

Introduction. An Introduction to Algorithms and Data Structures

A fast randomized algorithm for approximating an SVD of a matrix

Yale University Department of Computer Science

Chapter 1: Systems of Linear Equations and Matrices

Non-Asymptotic Theory of Random Matrices Lecture 4: Dimension Reduction Date: January 16, 2007

A New Combinatorial Approach For Sparse Graph Problems

MATH 320, WEEK 7: Matrices, Matrix Operations

CMPSCI611: Three Divide-and-Conquer Examples Lecture 2

Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College

csci 210: Data Structures Program Analysis

Iterative solvers for linear equations

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

Algorithms and Data Structures Strassen s Algorithm. ADS (2017/18) Lecture 4 slide 1

Acceleration of Randomized Kaczmarz Method

Fast Approximate Matrix Multiplication by Solving Linear Systems

Efficient Enumeration of Regular Languages

AMS 209, Fall 2015 Final Project Type A Numerical Linear Algebra: Gaussian Elimination with Pivoting for Solving Linear Systems

A Fast Algorithm For Computing The A-optimal Sampling Distributions In A Big Data Linear Regression

FPGA Implementation of a Predictive Controller

A Tutorial on Matrix Approximation by Row Sampling

Yale University Department of Computer Science

1 Divide and Conquer (September 3)

Lecture 21 November 11, 2014

Improved Quantum Algorithm for Triangle Finding via Combinatorial Arguments

Growth functions of braid monoids and generation of random braids

Boolean circuits. Lecture Definitions

Connections between exponential time and polynomial time problem Lecture Notes for CS294

Stat. 758: Computation and Programming

Tutorials. Algorithms and Data Structures Strassen s Algorithm. The Master Theorem for solving recurrences. The Master Theorem (cont d)

Consider the following example of a linear system:

A fast algorithm for the Kolakoski sequence

arxiv: v1 [cs.ds] 30 May 2010

Introduction to Algorithms 6.046J/18.401J/SMA5503

A graph theoretic approach to matrix inversion by partitioning

Compute the Fourier transform on the first register to get x {0,1} n x 0.

Discrete quantum random walks

Approximate Principal Components Analysis of Large Data Sets

Polynomial multiplication and division using heap.

Sparse Polynomial Multiplication and Division in Maple 14

From the Shortest Vector Problem to the Dihedral Hidden Subgroup Problem

An algebraic characterization of unary two-way transducers

Divide and conquer. Philip II of Macedon

Complexity of Matrix Multiplication and Bilinear Problems

The Master Theorem for solving recurrences. Algorithms and Data Structures Strassen s Algorithm. Tutorials. The Master Theorem (cont d)

Enabling very large-scale matrix computations via randomization

Determine the size of an instance of the minimum spanning tree problem.

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Friday Four Square! Today at 4:15PM, Outside Gates

Introduction to Techniques for Counting

Lecture 7: More Arithmetic and Fun With Primes

CSE525: Randomized Algorithms and Probabilistic Analysis April 2, Lecture 1

Math 1314 Week #14 Notes

arxiv: v1 [cs.sc] 17 Apr 2013

Random Methods for Linear Algebra

Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes

On the Computational Complexity of the Discrete Pascal Transform

2 Systems of Linear Equations

An introduction to parallel algorithms

Transcription:

The Mailman algorithm: a note on matrix vector multiplication Yale university technical report #1402. Edo Liberty Computer Science Yale University New Haven, CT Steven W. Zucker Computer Science and Appled Mathematics Yale University New Haven, CT Abstract Given an m n matrix A we are interested in applying it to a real vector x R n in less then the straightforward O(mn) time. For an exact, deterministic computation at the very least all entrees in A must be accessed, requiring O(mn) operations and matching the running time of naively applying A to x. However, we claim that if the matrix contains only a constant number of distinct values, then reading the matrix once in O(mn) steps is sufficient to preprocess it such that any subsequent application to vectors requires only O(mn/ log(max{m, n})) operations. Theoretically our algorithm can improve on recent results for dimensionality reduction and practically it is useful (faster) even for small matrices. Introduction A classical result of Winograd ([1]) shows that general matrix-vector multiplication requires Ω(mn) operations, which matches the running time of the naive algorithm. However, since matrix-vector multiplication is such a common, basic operation, an enormous amount of effort has been put into exploiting the special structure of matrices to accelerate it. For example, Fourier, Walsh Hadamard, Toeplitz, Vandermonde, and others can be applied to vectors in O(npolylog(n)). Others have focussed on matrix-matrix multiplication. For two n n binary matrices the historical Four Russians Algorithm [2] (modified in [3]) gives a log factor improvement over the näive algorithm, i.e, running Research supported by AFOSR and NGA 1

time of n 3 /log(n). Techniques for saving log factors in real valued matrix-matrix multiplications were also found. These include Santoro and Urrutia [4] and Savage [5]. These methods are reported to be practically more efficient then naive implementations. Classic theoretical results by Strassen [6] and Coppersmith and Winograd [7] achieve better results then the above mentioned but are slower then the naive implementation for small and moderately sized matrices. The above methods, however, do not extend to matrix-vector multiplication. We return to matrix-vector operations. Since every entry of the matrix, A, must be accessed at least once, we consider a preprocessing stage which takes Ω(mn) operations. After the preprocessing stage x is given and we seek an algorithm to produce the product Ax as fast as possible. Within this framework Williams [8] showed that an n n binary matrix can be preprocessed in time O(n 2+ɛ ) and subsequently applied to binary vectors in O(n 2 /ɛ log 2 n). Williams also extends his result to matrix operations over finite semirings. However, to the best of the authors understanding, his technique cannot be extended to real vectors. In this manuscript we claim that any m n matrix over a finite alphabet 1 can be preprocessed in time O(mn) such that it can be applied to any real vector x R n in O(mn/ log(max{m, n})) operations. Such operations are common, for example, in spectral algorithms for unweighted graphs, nearest neighbor searches, dimensionality reduction, and compressed sensing. Our algorithm also achieves all previous results (excluding that of Williams) while using a strikingly simple approach. The Mailman algorithm Intuitively, our algorithm multiplies A by x in a manner that brings to mind the way a mailman distributes letters, first sorting the letters by house and then delivering them. Recalling the identity Ax = n i=1 A(i) x(i), metaphorically one can think of each column A (i) as indicating one address or house, and each entry x(i) as a letter addressed to it. To continue the metaphor, imagine that computing and adding the term A (i) x(i) to the sum is equivalent to the effort of walking to house A (i) and delivering x(i). From this perspective, the naive algorithm functions by delivering each letter individually to its address, regardless of how far the walk is or if other letters are going to the same address. Actual mailmen, of course, know much better. First, they arrange their letters according to the shortest route (which includes all houses) without moving; then they walk the route, visiting each house regardless of how many letters should be delivered to it (possibly none). 1 A(i, j) Σ, and Σ is a finite constant. 2

To extend this idea to matrix-vector multiplication, our algorithm decomposes A into two matrices, P and U, such that A = UP. P is the address-letter correspondence matrix. Applying P to x is analogous to arranging the letters. U is the matrix containing all possible columns in A. Applying U to (P x) is analogous to walking the route. Hence, we name our algorithm after the age-old wisdom of the men and women of the postal service. Our idea can be easily described for an m n matrix A, where m = log 2 (n) (w.l.o.g assume log 2 (n) is an integer) and A(i, j) {0, 1}. (Later we shall modify it to include other sizes and possible entrees.) There are precisely 2 m = n possible columns of the matrix A, by construction, since each of the m column entrees can be only 0 or 1. Now, define the matrix U n as the matrix containing all possible columns over {0, 1} of length log(n). By definition U n contains all the columns that appear in A. More precisely, for any column A (j) there exists an index 1 i n such that A (j) = U (i) n. We can therefore define an n n matrix, P, such that A = UP. In particular the matrix P contains one 1 entree per column such that P (i, j) = 1 if A (j) = U (i) m. Applying U Denote by U n the log 2 (n) n matrix containing all possible strings over {0, 1} of length log(n). Notice immediately that U n can be constructed using U n/2 as follows: U 1 = ( 0 1 ), U n = 0,..., 0 1,..., 1 U n/2 U n/2 Applying U n to any vector z requires less then 3n operations, which can be shown by dividing z into its first and second halves, z 1 and z 2, and computing the product U n z recursively. 0,..., 0 1,..., 1 U n z = z 1 = U n/2 U n/2 z 2. 0 z 1 + 1 z 2 U n/2 (z 1 + z 2 ) Computing the vector z 1 + z 2, and the sums z 1 and z 2 requires 3n/2 operations. We get the recurrence relation T (2) = 2, T (n) = T (n/2) + 3n/2 T (n) 3n. The sum over z 1 is of course redundant here. However, it would not have been if the entrees in U were { 1, 1} for example. 3

Constructing P Since U is applied implicitly (not stored) we are only concerned with constructing P. An entry P (i, j) is set to 1 if A (j) = U n (i). Notice that by our construction of U n its i th column encodes the binary value of the index i. Thus, if a column of A contains the binary representation of the value i, it is equal to U (i). We therefore construct P by reading A and setting P (i, j) to 1 if A (j) represents the value i. This clearly can be done in time O(mn), the size of A. Since P contains only n nonzero entries (one for each column of A), it can be applied in n steps. Although A is of size log(n) n, it can be applied in linear time (O(n)) and a log(n) factor is gained. If the number of rows, m, in A is more then log(n) we can section A into at most m/ log(n) matrices each of size at most log(n) n. Since each of the sections can be applied in O(n) operations, the entire matrix can be applied in O(mn/ log(n)) operations. Remarks Remark 1 (n < m) Notice that if n < m we can section A vertically to sections of size m log(m) and argue very similarly that P T U T can be applied in linear time. Thus a log(m) saving factor in running time is achieved. Remark 2 (Constant sized alphabet) The definition of U n can be changed so that it encodes all possible strings over a larger alphabet Σ, Σ = S. U m = σ 1,..., σ 1... σ l,..., σ l U n/s... U n/s The matrix U n of size log S (n) n encodes all possible strings of length log S (n) over Σ and can be applied in linear time O(n). If the number of rows in A is larger, we divide it into sections of size log Σ (n) n. The total running time therefore becomes O(mn log(s)/ log(max{m, n}). Remark 3 (Matrix operations over semirings) Suppose we supply Σ with an associative and commutative addition operation (+), and a multiplication operation ( ) that distributes over (+). If both A and x are chosen from Σ the matrix vector multiplication over the semiring {Σ, +, } can be performed using the exact same algorithm. This is, of course, not a new result. However, it includes all other log-factor-saving results. 4

Dimensionality reduction Assume we are given p points {x 1,..., x p } in R n and are asked to embed them into R m such that all distances between points are preserved up to distortion ε and m n. A classic result by Johnson and Lindenstrauss [9] shows that this is possible for m = Θ(log(p)/ε 2 ). The algorithm first chooses a random m n matrix, A, from a special distribution. Then, each of the vectors (points) {x 1,..., x p } are multiplied by A. The embedding x i Ax i can be shown to have the desired property with at least constant probability. Clearly a naive application of A to each vector requires O(mn) operations. Recently Ailon and Liberty [10], modifying the work of Ailon and Chazelle [11], allows A to be applied in O(n log(m)) operations. We now claim that, in some situations, an older result by Achlioptas can be (trivially) modified to out preform this last result. In particular, Achlioptas [12] showed that the entries of A can be chosen i.i.d from { 1, 0, 1} with some constant probabilities. Using our method and Achlioptas s matrix, one can apply A to each vector in O(n log(p)/ log(n)ε 2 ) operations (remainder m = O(log(p)/ε 2 )). Therefore, if p is polynomial in n then log(n) = O(log(p)) and applying A to x i requires only O(n/ε 2 ) operations. For a constant ε this outperforms the best known O(n log(m)) and matches the lower bound. Notice that the dependance on ε is 1/ε 2 instead of log(1/ε) which makes this result actually much slower for most practical purposes. Experiments Here we compare the running time of applying a log(n) n, {0, 1} matrix A to a vector of floating point variables x R n using three methods. The first is a naive implementation in C. This simply consists of two nested loops ordered with respect to memory allocation. The second is an optimized matrix vector code. We test ourselves against LAPACK which uses BLAS subroutines. The third is, of course, the mailman algorithm following the preprocessing stage. Although the complexity of the first two methods is O(n log(n)) and that of the mailman algorithm is O(n) the improvement factor is well below log(n). This is because the memory access pattern in applying P is problematic at best, and creates many cache faults. Bare in mind that these results might depend on memory speed and size vs. CPU speed of the specific machine. The experiment below denotes the time required for the actual multiplication not including memory allocation. (Experiments were conducted on an Intel Pentium M 1.4Ghz processor, 598Mhz Bus speed and 512Mb of RAM.) As seen in figure 1, the mailman algorithm operates faster then the naive algorithm even for 4 16 5

Naive LAPACK Mailman Naive LAPACK Mailman 4 6 8 10 12 14 16 18 4 6 8 10 12 14 16 18 Figure 1: Running time of three different algorithms for matrix vector multiplication. Naive is a nested loop written in C. LAPACK is a machine optimized code. The mailman algorithm is described above. The matrices were of size m (on the x axis) by 2 m and they were multiplied by real (double precision) values vectors. The right figure plots the running times relative to those of the naive algorithm. matrices. It also outperforms the LAPACK procedure for most matrix sizes. The machine specific optimized code (LAPACK) is superior when the matrix row allocation size approaches the memory page size. A machine specific optimized mailman algorithm might take advantage of the same phenomenon and outperform the LAPACK on those values as well. Concluding remark It has been known for a long time that a log factor can be saved in matrix-vector multiplication when the matrix and the vector are over constant size alphabets. In this paper we described an algorithm that achieves this while also dealing with real-valued vectors. As such the idea by itself is neither revolutionary nor complicated, but it is useful in current contexts. We showed, as a simple application of it, that random projections can be achieved asymptotically faster then the best currently known algorithm, provided the number of projected points is polynomial in their original dimension. Moreover, we saw that our algorithm is practically advantageous even for small matrices. 6

References [1] Shmuel Winograd. On the number of multiplications necessary to compute certain functions. Communications on Pure and Applied Mathematics, (2):165 179, 1970. [2] M.A. Kronrod V.L. Arlazarov, E.A. Dinic and I.A. Faradzev. On economic construction of the transitive closure of a direct graph. Soviet Mathematics, Doklady, (11):1209 1210, 1970. [3] N Santoro and J Urrutia. An improved algorithm for boolean matrix multiplication. Computing, 36(4):375 382, 1986. [4] Nicola Santoro. Extending the four russians bound to general matrix multiplication. Inf. Process. Lett., 10(2):87 88, 1980. [5] John E. Savage. An algorithm for the computation of linear forms. SIAM J. Comput., 3(2):150 158, 1974. [6] Volker Strassen. Gaussian elimination is not optimal. Numerische Mathematik, (4):354 356, 08 1969. [7] D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. In STOC 87: Proceedings of the nineteenth annual ACM conference on Theory of computing, pages 1 6, New York, NY, USA, 1987. ACM. [8] Ryan Williams. Matrix-vector multiplication in sub-quadratic time: (some preprocessing required). In SODA 07: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 995 1001, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathematics. [9] W.B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math., 26:189 206, 1984. [10] Nir Ailon and Edo Liberty. Fast dimension reduction using rademacher series on dual bch codes. In Symposium on Discrete Algorithms (SODA), accepted, 2008. [11] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In Proceedings of the 38st Annual Symposium on the Theory of Compututing (STOC), pages 557 563, Seattle, WA, 2006. 7

[12] Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671 687, 2003. 8