Verifiable Stream Computation and Arthur-Merlin Communication

Size: px

Start display at page:

Download "Verifiable Stream Computation and Arthur-Merlin Communication"

Alexia Hart
6 years ago
Views:

1 Verifiable Stream Computation and Arthur-Merlin Communication Amit Chakrabarti 1, Graham Cormode 2, Andrew McGregor 3, Justin Thaler 4, and Suresh Venkatasubramanian 5 1 Department of Computer Science, Dartmouth College. 2 Department of Computer Science, University of Warwick. 3 Department of Computer Science, UMass Amherst. 4 Yahoo Labs, New York. 5 School of Computing, University of Utah. Abstract In the setting of streaming interactive proofs (SIPs), a client (verifier) needs to compute a given function on a massive stream of data, arriving online, but is unable to store even a small fraction of the data. It outsources the processing to a third party service (prover), but is unwilling to blindly trust answers returned by this service. Thus, the service cannot simply supply the desired answer; it must convince the verifier of its correctness via a short interaction after the stream has been seen. In this work we study barely interactive SIPs. Specifically, we show that two or three rounds of interaction suffice to solve several query problems including Index, Median, Nearest Neighbor Search, Pattern Matching, and Range Counting with polylogarithmic space and communication costs. Such efficiency with O(1) rounds of interaction was thought to be impossible based on previous work. On the other hand, we initiate a formal study of the limitations of constant-round SIPs by introducing a new hierarchy of communication models called Online Interactive Proofs (OIPs). The online nature of these models is analogous to the streaming restriction placed upon the verifier in an SIP. We give upper and lower bounds that (1) characterize, up to quadratic blowups, every finite level of the OIP hierarchy in terms of other well-known communication complexity classes, (2) separate the first four levels of the hierarchy, and (3) reveal that the hierarchy collapses to the fourth level. Our study of OIPs reveals marked contrasts and some parallels with the classic Turing Machine theory of interactive proofs, establishes limits on the power of existing techniques for developing constant-round SIPs, and provides a new characterization of (non-online) Arthur Merlin communication in terms of an online model ACM Subject Classification F.1.3 Complexity Measures and Classes Keywords and phrases streaming interactive proofs, Arthur-Merlin communication complexity Digital Object Identifier /LIPIcs.CCC.2015.p 1 Introduction The surging popularity of commercial cloud computing services, and more generally outsourced computations, has revealed compelling new applications for the study of interactive proofs with highly restricted verifiers. Consider, e.g., a retailer (verifier) who lacks the resources to locally process a massive input (say, the set of all its transactions), but can access a powerful but untrusted cloud service provider (prover), who processes the input on the retailer s behalf. The verifier must work within the confines of the restrictive data streaming paradigm, using only a small amount of working memory. The prover must both answer queries about the input (say, how many pairs of blue jeans have I ever sold? ), and prove that the answer is correct. We refer to this general scenario as verifiable data stream computation.

2 2 Verifiable Stream Computation and Arthur-Merlin Communication It is useful to look at this computational scenario as data stream algorithms with access to a powerful (space-unlimited) prover. As is well known, most interesting data streaming problems have no nontrivial (i.e., sublinear space) algorithms unless one allows approximation. For instance, given a stream σ of tokens from the universe [n] := {1,2,...,n}, which implicitly defines frequencies f j for each j [n], some basic questions we can ask about σ are the number of distinct tokens F 0 (σ), the kth frequency moment F k (σ) = n j=1 f j k, the median of the collection of numbers in σ, and the very basic point queries where, given a specific j [n] after σ has been presented, we wish to know f j. In each case, we would like an exact answer, not an estimate. With the trivial exception of F 1 (σ) which is just the length of σ not one of these questions can be answered by a (possibly randomized) streaming algorithm restricted to o(n) space. However, with access to a powerful prover, things improve greatly: as shown in Chakrabarti et al. [9], point queries, median, and F k (for integral k > 0) can be computed exactly by a verifier using Õ( n) space, while receiving Õ( n) bits of help from the prover. Notably, the protocol achieving this Õ( n) cost (space plus amount of help) is non-interactive: the prover sends a single message to the verifier. Chakrabarti et al. [9] also showed that under this restriction their protocol is optimal: a cost of Ω( n) is required. In subsequent work, Cormode et al. [15] considered streaming interactive proofs (SIPs), where the verifier may engage in several rounds of interaction with the prover, seeking to minimize both the space used by the verifier and the total amount of communication. They gave SIPs with 2k 1 rounds of interaction following the verifier s single pass over the input stream, achieving a cost of Õ(n 1/(k+1) ) for the above problems. This generalizes the earlier set of results [9], which gave 1-round SIPs. Moreover, it achieves O(polylogn) cost with O(logn/loglogn) rounds of interaction. In recent work, Klauck and Prakash [24] further studied this kind of computation and generalized the 1-round lower bound, claiming that a (2k 1)- round SIP must cost Ω(n 1/(k+1) ), even for very basic point queries. However, we identify an implicit assumption in the Klauck Prakash lower bound argument: it applies only to protocols in which the verifier s messages to the prover are independent of the input. This happened to hold in all previous SIPs, which are ultimately descended from the sum-check protocol of Lund et al. [27]. Furthermore, this assumption is harmless in the classical theory of interactive proofs where public-coin protocols can simulate private-coin ones with just a polynomial blowup in cost [16]. However, these simulation results fail subtly in the streaming setting, and we show that this failure is intrinsic by giving a number of new upper bounds. 1.1 New Results: Exponentially Improved Constant-Round SIPs We start by showing that even two-round SIPs are exponentially more powerful than previously believed, on certain problems. For now we state our results informally, using the Õ-notation to suppress lower order factors. We give formal theorem statements later in the paper, after all definitions are in place. Result 1.1 (Formalized in Theorem 3.1). There is a two-round SIP with cost Õ(logn) for answering point queries on a stream over the universe [n]. The SIP that achieves this upper bound is based on an abstract protocol that we call the polynomial evaluation protocol. Crucially, unlike the sum-check protocols used in previous SIPs, it involves an interaction where the verifier s message to the prover depends on part of the input; specifically, it depends on the query. Note that two rounds of interaction is likely unavoidable in practice even if verifiability is not a concern: one round may be required for the verifier to communicate the query to the prover, with a second round required for the prover to reply. Adding a third round of interaction allows us to answer selection queries, of which an important special case is median-finding.

3 A. Chakrabarti, G. Cormode, A. McGregor, J. Thaler, and S. Venkatasubramanian 3 Result 1.2 (Formalized in Theorem 3.7). There is a three-round SIP with cost Õ(logn) for determining the exact median of a stream of numbers from [n]. We can in fact answer fairly complex queries with three rounds and polylogarithmic cost. For instance, given a data set presented as a stream of points from a metric space, we can answer exact nearest neighbor queries to the data set very efficiently, even in high dimensions. This is somewhat surprising, given that even the offline version of the problem seems to exhibit a curse of dimensionality. Result 1.3 (Formalized in Theorem 3.4). For data sets consisting of points from [n] d under a reasonable metric, such as the Manhattan distance l d 1 or the Euclidean distance ld 2, there is a three-round SIP with cost poly(d,logn) allowing exact nearest neighbor queries to the data set. We also give similarly efficient two-round SIPs for other well-studied query problems, such as range counting queries (Theorem 3.6), where a stream of data points is followed by a query range and the goal is to determine the number of points in the range that appeared in the stream, and pattern matching queries (Theorem 3.8), where a streamed text is followed by a (short) query pattern. Next, we work towards a detailed understanding of the subtleties of SIPs that caused the aforementioned Klauck Prakash lower bound [24] not to apply. Our study naturally leads into communication complexity, in particular to Arthur Merlin communication, which we discuss next. 1.2 The Connection to Arthur Merlin Communication Like almost all previous lower bounds for data stream computations, prior SIP lower bounds [9, 24] use reductions from problems in communication complexity. To model the prover in an SIP, the appropriate setting is Arthur Merlin communication, which we now introduce. Suppose Alice holds an input x X, Bob holds y Y, and they wish to compute f (x,y) for some Boolean function f : X Y {0,1}, using random coins and settling for some constant probability of error. Say this costs R( f ) bits of communication. Can an omniscient but untrusted Merlin, who knows (x,y), convince Arthur (defined as Alice and Bob together) that f (x,y) = 1, keeping the overall communication within o(r( f ))? For several interesting functions f the answer is Yes and this is the general subject of Arthur Merlin communication complexity, first considered in seminal work by Babai, Frankl, and Simon [5]. The one-pass streaming restriction on the verifier in an SIP is modeled by requiring that Alice not receive any communication from either Bob or Merlin. Thus the Alice Bob communication is one-way, though Bob and Merlin may interact arbitrarily. We refer to this restricted communication setting as online Arthur Merlin communication. It should be clear that a k-round SIP with cost C can be simulated by an online Arthur Merlin communication of cost C where Bob and Merlin interact for k rounds. Thus, lower bounds on SIPs would follow from corresponding communication lower bounds in the online Arthur Merlin setting. At this point let us recall that the classical Turing-Machine-based theory of interactive proofs considers two different models of interaction between prover and verifier, corresponding to the complexity classes IP TM, 1 where the verifier is allowed private randomness, and AM TM, where he may only use public randomness. Recall the following classic results about such interactive proofs. Equivalence of private and public coins. Goldwasser and Sipser [16] proved that a k-round private coin interactive proof (à la IP TM ) can be simulated (with a polynomial blowup in complexity) by a (k + 2)-round public coin one (à la AM TM ). Thus, in the resulting protocol, the verifier can perform his interaction with the prover before even looking at the input! 1 Throughout this paper, we use the subscript TM to denote a Turing-machine-based complexity class, to resolve the notation clash with the analogous communication complexity classes. C C C

4 4 Verifiable Stream Computation and Arthur-Merlin Communication Round reduction. Babai and Moran [6] proved that a (k + 1)-round interactive proof can be simulated by a k-round interactive proof with a polynomial blowup in the verifier s complexity. Thus, a two-round (verifier prover verifier) interactive proof is just as powerful as any constantround one. Interestingly, as we shall show in this work, neither of these phenomena holds for the online communication complexity analogs of IP TM and AM TM. (Recall that online means that Alice does not receive any communication from either Bob or Merlin.) This point appears to have been missed in the Klauck Prakash proof [24], which works in a public coin setting and thus applies only to a restricted class of SIPs. The new SIPs we design in this work correspond to a private coin setting, which allows the aforementioned exponential improvements. Clearly there are nuances in online Arthur Merlin communication complexity that do not arise in classical interactive proofs. In particular, we seek a better understanding of the precise role of rounds and of private randomness in the communication setting. This is the goal of our next batch of results. 1.3 New Results: Complexity Classes for Arthur Merlin Communication As noted above, we can think of AM TM as a restricted interactive proof model where the verifier must interact with the prover before looking at his input. We can then define a hierarchy of analogous communication complexity models called OMA [k] (Online Merlin Arthur), where Bob interacts with Merlin in k rounds without looking at his input, and then Alice communicates with Bob one-way. We defer precise definitions to Section 4. The aforementioned Klauck Prakash lower bound essentially says the following: Proposition 1.4 (Klauck and Prakash [24]). The INDEX problem where Alice gets x {0,1} n, Bob gets j [n] and Bob must output x j with high probability requires Ω(n 1/(k+1) ) cost in the OMA [2k] model. We can also define a parallel hierarchy OIP [k] (Online Interactive Proof) of communication analogs of IP TM. We now hit another subtlety. We could require the Bob Merlin interaction to happen before the Alice Bob communication; this is how we shall define OIP [k]. Alternatively, we could swap the order, so that Bob s messages to Merlin could depend on Alice s input as well; we shall call the resulting (more powerful) model OIP [k] +. These communication models correspond to SIPs as follows. Every SIP designed prior to this work falls into a restricted setting where the verifier s messages are independent of the input, so it can be simulated by an OMA [k] protocol with k being the number of rounds of interaction in the SIP. The SIPs we design in this work apply to query problems with the data set appearing before the query, and our verifier messages depend only on the query. Thus our SIPs are naturally simulable by OIP [k] protocols. Finally, a general SIP, where verifier messages can depend on the entire input stream, is simulable by an OIP [k] + protocol. Following Babai et al. [5], given a communication model C, we define a corresponding complexity class, also denoted C, consisting of all problems that have polylogarithmic cost protocols in the model C. We now have three parallel hierarchies of communication complexity classes: OMA [k], OIP [k], and OIP [k] +. For our next batch of results, we prove several inclusion and separation results relating these newly defined classes to each other and to well-studied classes from earlier work in communication complexity. Result 1.5 (Formalized over several theorems in Section 5). The following complexity class inclusions and separations, given in Figure 1, hold.

5 A. Chakrabarti, G. Cormode, A. McGregor, J. Thaler, and S. Venkatasubramanian 5 R [1,A] R [2,B] MA [2,B] AM OMA [k] OIP [1] OIP [2] OIP [3] OIP [4] OIP [k] OIP [1] + R [3,A] OIP [2] + Figure 1 The layout of our communication complexity zoo. An arrow from C 1 to C 2 indicates that C 1 C 2. If the arrow is double-headed, then the inclusion is strict. Here k > 4 is an arbitrary constant. The models R [t,a] (resp. R [t,b] ) are standard t-round randomized communication with Alice (resp. Bob) starting. The model MA [2,B] consists of a message from Merlin followed by Bob Alice Bob communication, while AM is standard (see Section 5). Notice that there are several two-way inclusions (i.e., equalities) amongst these communication complexity classes. It is worth noting that with one exception (namely OIP [1] = OIP [1] + ) none of these equalities is trivial. For instance, consider the switch from the model R [2,B] to the model OIP [2] : Bob loses the ability to send Alice a message before hearing from her, but gains access to Merlin. It is not a priori clear that this switch in models will result in a complexity class that is even comparable to R [2,B], and nontrivial simulation arguments (Theorems 5.3 and 5.6) are required to prove that R [2,B] = OIP [2]. Many of our simulations incur some blowup in cost. All such blowups are at most quadratic, so polylogarithmic costs remain polylogarithmic. The OMA and OIP hierarchies behave quite differently from the classical AM TM and IP TM : In contrast to the Goldwasser Siper private-by-public-coin theorem, the class OIP [4] is strictly more powerful than OMA [k] (in fact, even OIP [2] OMA [k] ), for every constant k. In contrast to the Babai Moran round reduction theorem, there are exactly four distinct levels (not two) in the OIP [k] hierarchy, for constant k. In the course of proving the separation results in Figure 1, we obtain concrete lower bounds for explicit functions that are of interest in their own right. Let us highlight one of these. Result 1.6 (Formalized in Theorem 5.9). The set disjointness problem DISJ where Alice and Bob each get a subset of [n] and must decide whether they are disjoint requires Ω(n 1/3 ) cost in the OIP [3] model and thus does not belong to the class OIP [3]. This lower bound is tight up to a logarithmic factor. This has implications for SIPs. We noted that all SIPs designed thus far (including the new ones in this work) are simulable in the weaker OIP models. By a standard reduction [4] from DISJ to the frequency moments problem F k, it follows that unlike what we achieved for point queries and median queries, based on currently known techniques, we cannot get a polylogarithmic cost three-round SIP for F k (k 1). Removing the qualifier based on currently known techniques above would require a similar lower bound for OIP [3] +. Unfortunately, at present we are unable to prove any nontrivial lower bounds on OIP [2] +, and doing so appears to be a rather difficult problem. Indeed, this inability is what led us to study the weaker OIP model. Yet, because the OIP models are online, the separation results in Figure 1 still morally capture the primary way in which SIPs, due to their streaming/online nature, differ from classical interactive proofs. C C C

6 6 Verifiable Stream Computation and Arthur-Merlin Communication Finally, our result AM = OIP [4] gives a novel characterization of AM in terms of online communication. This is surprising because online models, where no one talks to Alice, might be expected to be too weak to capture AM. Proving lower bounds on AM is a longstanding and notoriously hard problem in communication complexity [22, 23, 26]. We believe our new characterization of AM is of independent interest, and may prove useful in establishing non-trivial AM lower bounds. 1.4 Related Work Stream Computation Early theoretical work on verifiable stream computation focused on non-interactive protocols, as in the annotated data streams model of Chakrabarti et al. [9]. In our language, that model corresponds to 1-round SIPs. Work in this model has established optimal protocols for several problems including frequency moments and frequent items [9]; linear algebraic problems such as matrix rank [24]; and graph problems like shortest s t path [14]. Many of these protocols have subsequently been optimized for streams whose length is much smaller than the universe size [8]. More recent protocols, such as the Arthur Merlin streaming protocols of Gur and Raz [8, 19] are barely interactive in the sense that the prover and the verifier may exchange a constant number of messages. Meanwhile, the fully general streaming interactive proof (SIP) model of Cormode et al. [13, 15] permits many rounds of interaction. Cormode, Thaler, and Yi [15] showed that several general IP TM protocols can be simulated in this model. These include the powerful, general-purpose protocol of Goldwasser, Kalai, and Rothblum [18]. Given any problem in NC TM, the resulting protocol requires only polylogarithmic space and communication while using polylogarithmic rounds of verifier prover interaction. Refinements and implementations of these protocols [13, 35, 36] have demonstrated scalability and the practicality of this line of work. Algebraic techniques lie at the core of almost all nontrivial protocols in the above models. Specifically, a number of 1-round SIPs are inspired by the Aaronson Wigderson MA communication protocol for DISJ [2], which is in turn inspired by the classic sum-check protocol of Lund et al. [27]. The sum-check protocol is also the inspiration for the way that all previous multi-round SIPs make use of interaction. The aforementioned protocol of Goldwasser et al. [18] also builds upon the sum-check protocol. The algorithmic results outlined in Section 1.1 have a rather different algebraic idea at their core. They are based on the aforementioned polynomial evaluation protocol, which is obtained by adapting a result of Raz [33] about IP/rpoly TM to a streaming setting; see the discussion at the start of Section 2.1. Early work on interactive proofs studied space-bounded verifiers (see the survey by Condon [12]), but many protocols developed in this line of work require the verifier to store the input, and therefore do not address verifiable stream computation, as we do here. Goldwasser et al. [17] studied interactive proofs with verifiers in the complexity class NC 0 TM. Interestingly, they showed that private randomness is necessary to obtain interactive proofs with verifiers in NC 0 TM, unless the language in question is already in NC 0 TM. This is analogous to our finding that constant-round public coin SIPs (where the verifier s messages do not depend on the input) are exponentially weaker than general constant-round SIPs Computationally Sound Protocols Protocols for verifiable stream computation have also been studied in the cryptography community [10, 32, 34]. These works only require soundness to hold against cheating provers that run in polynomial time. In exchange for this weaker security guarantee, these protocols can achieve

7 A. Chakrabarti, G. Cormode, A. McGregor, J. Thaler, and S. Venkatasubramanian 7 properties that are impossible in the information-theoretic setting we consider. For example, they typically achieve reusability, allowing the verifier to use the same randomness to answer many queries. In contrast, our protocols only support one-shot queries, because they require the verifier to reveal secret randomness to the prover. Chung et al. [10] combine the GKR protocol with fully homomorphic encryption (FHE) to give reusable, non-interactive protocols of polylogarithmic cost for any problem in NC. Papamanthou et al. [32] give improved protocols for a class of low-complexity queries including point queries and range search: their protocols avoid the use of FHE, and allow the prover to answer such queries in polylogarithmic time (a similar property was achieved by Schröder and Schröder [34], but for a simpler class of queries, and with unidrectional communication from the verifier to the prover on each stream update). In contrast, prior work as well as our own requires the prover to spend time quasilinear in the size of the data stream after receiving a query, even if the answer itself can be computed in sublinear time (e.g., point queries, which can be solved with a single access to memory). We note however that our most interesting protocols, such as those for nearest neighbor search and pattern matching, are for problems that cannot be solved in sublinear time; hence, the quasilinear time required by our protocols does not affect the prover s runtime by more than logarithmic factors Communication Complexity Seminal work by Babai et al. [5] introduced and studied the communication analogs of the major Turing Machine complexity classes, including P, NP, Σ 2, Π 2. They hinted at similar analogs of MA and the AM hierarchy. Lokam [26] related the task of placing problems outside of the communication class AM to notions of matrix rigidity. He also observed that the communication complexity classes IP and AM behave similarly to their Turing Machine counterparts. In particular, noted theorems such as IP = PSPACE, Toda s Theorem, and Babai and Moran s round reduction results [6] all hold in the communication world (though not under online communication, as shown by this work). Online (also known as one-round) randomized communication complexity was introduced in the mid-1990s and considered by Ablayev [3], Kremer, Nisan, and Ron [25], and Newman and Szegedy [29]. Aaronson [1] introduced online variants of Merlin Arthur communication, in classical and quantum flavors. Aaronson and Wigderson [2] gave an online MA communication protocol for DISJ (more generally, for INNER-PRODUCT) with cost Õ( n); this is nearly optimal, as shown by a lower bound of Klauck [22] that applies to general MA protocols. More recently, Klauck [23] performed a careful study of AM, MA, and its quantum analogue QMA. In particular, he gave a promise problem PAPPMP separating QMA from AM; we shall eventually show that PAPPMP separates OIP [3] from OIP [4]. 2 The SIP Model and the Polynomial Evaluation Protocol In a data stream problem, the input σ is a stream, or sequence, of tokens from some data universe U. The goal is to compute or approximate some function g(σ), keeping space usage sublinear in the two key size parameters: (1) the length of σ, and (2) the size of the universe U. Practically speaking, we would also like to process each stream update (token arrival) quickly. All our data stream algorithms will be randomized, and we shall allow them to err with some small constant probability on each input stream. In the streaming interactive proofs (SIP) model, after processing σ, the algorithm (called the verifier ) may engage in k rounds of interaction with an oracle (the prover ) who knows σ and whose goal is to lead the verifier to output the correct answer g(σ). The verifier, being distrustful, will output (indicating abort ) if he suspects the prover to be cheating. All of the SIPs in this paper will work in the turnstile streaming model, where σ can contain deletions of tokens from U, in addition to insertions. In this model it is best to think of the input as C C C

8 8 Verifiable Stream Computation and Arthur-Merlin Communication being a stream of integer updates to a vector x = (x 1,...,x n ) Z n. Initially x = 0, and an update is a tuple (i,c) [n] Z, which has the effect of adding c to the entry x i. We will sometimes describe our algorithms as they apply to the vanilla streaming model, but it will be straightforward to extend them to the turnstile model. We say that an SIP computes the function g with completeness error ε c and soundness error ε s if for all inputs x there exists a prover strategy that will cause the verifier to output g(x) with probability at least 1 ε c, and no prover strategy can cause the verifier to output a value outside {g(x), } with probability larger than ε s. In designing SIPs, our goal will be to achieve ε c,ε s 1/3; clearly the theory remains unchanged if we replace 1/3 by another constant in (0,1/2). A SIP with ε c = 0 is said to have perfect completeness. The total length of the verifier prover interaction is the help cost. The space used by the streaming verifier is the space cost. The cost of an SIP is the sum of its help cost and its space cost. When designing SIP protocols we will also discuss the time complexities of the prover and the verifier. To keep things simple, we consider a model in which all arithmetic operations on a finite field of size n O(1) can be executed in unit time. 2.1 The Polynomial Evaluation Protocol We shall present a two-round SIP for an abstract data stream problem called polynomial evaluation, where the input consists of a multivariate polynomial described implicitly, as a table of values, followed by a point at which the polynomial must be evaluated. Without space constraints, this problem simply amounts to interpolation followed by direct evaluation, but our goal is to obtain a protocol where the verifier uses space roughly logarithmic in the size of the table of values, and is convinced by the prover about the correct answer after a similar amount of communication. For ease of presentation, we shall first consider a concrete special setting that is important in its own right: the INDEX problem. In this problem, the input is a stream of n data bits x 1,...,x n, followed by a query index j [n]. The goal is to output x j with error at most 1/3. With very different motivations from ours, Raz [33] gave an interactive proof protocol placing every language in IP TM /rpoly, the class of languages that have interactive proofs with polynomialtime verifiers that take randomized advice, where the advice is kept secret from the prover. Our SIP for INDEX can be seen as an adaptation of Raz s interactive proof to the streaming setting. Theorem 2.1. The INDEX problem has a two-round SIP with cost O(lognloglogn), in which the verifier processes each stream token in O(logn) time and the prover runs in total time O(nlogn). Proof. Assume WLOG that n = 2 b, for some integer b. Identify each integer z [n] with a Boolean vector z = (z 1,...,z b ) {0,1} b in some canonical way, such as by using the binary representation of z. We can then view the data bits as a table of values for the Boolean function g x : {0,1} b {0,1} given by g x (z) = x z, and thus for the multilinear b-variate polynomial g x (Z 1,...,Z b ) given by g x (Z 1,...,Z b ) = χ u (Z 1,...,Z b ) = z {0,1} b g x (z)χ z (Z 1,...,Z b ), where (1) b i=1 ( (1 ui )(1 Z i ) + u i Z i ) is the indicator function of the vector u = (u 1,...,u b ). We shall interpret g x as a polynomial in F[Z 1,...,Z b ] for a fixed large enough finite field F. With this interpretation, g x is called the multilinear extension of g x to F. We define a line in F b to be the range of a nonconstant affine function from F to F b. Every line contains exactly F points. Given such a line, l, we define its canonical representation to be the degree-1 polynomial λ l (W) F b [W] such that λ l (0) and λ l (1) are, respectively, the lexicographically first and second points in l. We define the canonical restriction of (2)

9 A. Chakrabarti, G. Cormode, A. McGregor, J. Thaler, and S. Venkatasubramanian 9 a polynomial f (Z 1,...,Z b ) to l to be the univariate polynomial f (λ l (W)) F[W], whose degree is at most the total degree of f. Using the above notations and conventions, our two-round SIP for INDEX works as shown in Figure 2. Input: Stream of data bits (x 1,...,x n ) where n = 2 b, followed by index j [n]. Goal: Prover to convince Verifier to output the correct value of x j. Shared Agreement: Finite field F with 3b + 1 F 6b + 2; bijective map u [n] u {0,1} b. Initialization: Verifier picks r R F b uniformly at random, sets Q 0. Stream Processing: Upon reading x z, where z [n], Verifier updates Q Q + x z χ z (r). Query Handling: Upon reading the index j, Verifier interacts with Prover as follows: 1. If j = r, Verifier outputs Q as the answer. Otherwise, he sends Prover l, the unique line in F b through j and r. 2. Prover sends Verifier a polynomial h(w) F[W] of degree at most b, claiming that it is the canonical restriction of the multilinear polynomial g x (Z 1,...,Z b ) to the line l. That is, Prover claims that h(w) g x (λ l (W)). 3. Let w,t F be such that λ l (w) = j and λ l (t) = r. Verifier checks that h(t) = Q, aborting if not. If the check passes, Verifier outputs h(w) as the answer. Figure 2 A Two-Round Streaming Interactive Proof (SIP) Protocol for the INDEX Problem To analyze this protocol, first note that after reading all the data bits, the verifier would have computed Q = g x (r), by Eq. (1). Now the protocol is easily seen to have perfect completeness. Since g x (Z 1,...,Z b ) is multilinear, it follows that deg( g x (λ l (W))) b, so the prover can always honestly choose h(w) = g x (λ l (W)). If he does so, then we will indeed have h(t) = g x (λ l (t)) = g x (r) = Q, and the verifier s check will pass. Finally, the verifier will output h(w) = g x (λ l (w)) = g x (j) = x j, the correct answer to the INDEX instance. Next, we analyze soundness. If the prover supplies a polynomial h(w) g x (λ l (W)), then, since both polynomials have degree at most b, they agree at at most b points in F. From the prover s perspective after he receives the verifier s message, r is uniformly distributed in l \ {j}. Thus, Pr r [h(t) = Q] b/( F 1) 1/3. Now we consider this protocol s costs. The verifier maintains the random point r F b and the running sum Q F, using O(blog F ) space. He sends the prover l, which is specified by two elements of F b, and receives a degree-b polynomial in F[W]; both communications use at most O(blog F ) bits. Recalling that F 6b + 2, we see that both space and communication costs are in O(blogb) = O(lognloglogn). Finally, we consider the verifier s and prover s runtimes. The honest prover must send the univariate polynomial g x (λ l (W)). Since g x has degree at most b, it suffices for the prover to specify the evaluations of g x (λ l (W)) at b + 1 = O(logn) points. A direct application of Eqs. (1) and (2) shows that each evaluation can be done in O(nlogn) time, resulting in a total runtime of O(nlog 2 n). However, using now-standard memoization techniques (see e.g. [36, Section 5.1]), it is possible for the prover to in fact perform each of these evaluations in just O(n) time, resulting in a total runtime of O(nlogn). The verifier can run in O(b) = O(logn) time per stream update, as each stream update x z only requires the verifier to compute χ z (r), and it follows from Eq. (2) that this can be done with O(b) field operations. When interacting with the prover, the verifier first needs to determine the line l through j and r, which he can do in O(b) = O(logn) time. To process the prover s reply, he must C C C

10 10 Verifiable Stream Computation and Arthur-Merlin Communication evaluate the polynomial h at the points t and w; these evaluations can be done in polylogn time. The above SIP protocol uses very little of the special structure of the INDEX problem. Let us abstract out its salient features, so as to handle the general problem described at the start of this section. First, note the protocol treats the data set given by (x 1,...,x n ) as an implicit description of the polynomial g x. Second, note that our soundness analysis did not require multilinearity per se, only an upper bound on the total degree of g x. Finally, note that the specific form of Eqs. (1) and (2) is not crucial either; all we used was that it allows the verifier an easy streaming computation. Thus, we obtain the following generic result. Theorem 2.2 (Polynomial Evaluation Protocol). Suppose an input data stream implicitly describes a v-variate polynomial g of total degree d over a field F, followed by a point j F v. Suppose this implicit description allows a streaming verifier to evaluate g at a random point r R F v using space S. Then the technique of the protocol in Figure 2 gives a two-round SIP for computing g(j), with the following properties: (1) perfect completeness; (2) soundness error bounded by d/( F 1); (3) space usage in O(v log F + S); (4) help cost in O((d + v) log F ). We shall refer to the abstract protocol given by Theorem 2.2 as the polynomial evaluation protocol. 3 Constant-Round SIPs for Query Problems We shall now apply the polynomial evaluation protocol to design SIPs proving the various upper bounds outlined in Section 1.1. The first application is immediate; later applications bring in additional ideas. 3.1 Point Queries. In the POINTQUERY problem, the input is a stream in the turnstile model, updating an initially-zero vector x Z n, followed by a query j [n]. The goal is to output x j. Theorem 3.1. Suppose the input to POINTQUERY is guaranteed to satisfy x i q at end of the data stream, for all entries of x, where the bound q is known a priori. Then there is a two-round SIP for POINTQUERY with space and help costs in O(log n log(q + log n)). Proof. Assume WLOG that n = 2 b for an integer b, and use a bijection u [n] u {0,1} b as in Theorem 2.1. The vector x resulting from the updates defines a multilinear polynomial g x (Z 1,...,Z b ) by Eq. (1), where g x (z) := x z. We can treat g x as a polynomial over any field we like, but to solve our problem, we need to tell apart the 2q + 1 possible values taken on by the entries of x (recall that q is an upper bound on x at the end of the stream). For this it suffices to have char(f) 2q + 1. Applying the polynomial evaluation protocol is now straightforward. The verifier starts with r R F b and Q = 0. Upon receiving an update indicating x i x i + c, he updates Q Q + cχ i (r). The other details are as in Figure 2. The space and communication costs are both in O(blog F ) as before. To ensure a soundness error of at most 1/3, we let F > 3b as before. This and the earlier condition on char(f) can both be satisfied by, e.g., taking F = F p, for a prime p > 3b + 2q. This translates to cost bounds in O(lognlog(q + logn)), as claimed.

11 A. Chakrabarti, G. Cormode, A. McGregor, J. Thaler, and S. Venkatasubramanian Nearest Neighbor Queries Consider a premetric space 2 (X,D) given by a finite ground set X and distance function D : X X R + satisfying D(x,x) = 0 for all x X. Let B D (z,r) = {x X : D(x,z) r} denote the corresponding ball of radius r R + centered at z X. In the NEARESTNEIGHBOR problem, the input consists of a stream x (1),...,x (m) of m points from X, constituting the data set, followed by a query point z X. The goal is to output x = argmin x (i) D(x (i),z), the nearest neighbor of z in the data set. We shall give highly efficient SIPs for this problem that handle rather general distance functions D. To keep our statements of bounds simple, we shall impose the following structure on (X,D). We assume that X = [n] d. We think of d as the dimensionality of the data, and [n] d as a very fine grid over the ambient space of possible points. For all x,y [n] d, D(x,y) 1 is an integer multiple of a small parameter ε 1/n d. Overall, this amounts to assuming that our data set has polynomial spread: the ratio between the maximum and minimum distance. We proceed to give two SIPs for NEARESTNEIGHBOR. Our basic SIP has cost roughly logarithmic in the stream length and the spread (and therefore linear in d but only logarithmic in n). After we present it, we shall critique it and then give a more sophisticated SIP to handle its faults. Theorem 3.2. Under the above assumptions on the premetric space (X, D), the NEAREST- NEIGHBOR problem has a three-round SIP with cost O(d log n log(m + log(d log n))). Proof. Let B = {B D (x, jε) : x X, j Z,0 j 1/ε} be the set of all balls of all radii between 0 and 1 (quantized at granularity ε). By our assumptions on the structure of (X,D), we have B n d /ε n 2d. The input stream x (1),...,x (m) defines a derived stream, consisting of updates to a vector v indexed by the elements of B. We shall denote by v[β] the entry of v indexed by β B. The derived stream is defined as follows: the token x (i) increments v[β] for every ball β that contains x (i). The verifier runs the POINTQUERY protocol of Theorem 3.1 on this derived stream. The verifier learns the query point z at the end of the stream. The prover then supplies a point y claimed to be a valid nearest neighbor (note that there may be more than one valid answer). To check this claim, it is sufficient for the verifier to check two properties: (1) that y did appear in the stream, and (2) that the stream contained no point closer to z than y. The first property holds iff v[b D (y,0)] 0. The second property holds iff v[b D (z,d(y,z) ε)] = 0. Clearly, these two properties can be checked by two point queries over the derived stream. Following the protocol of Theorem 3.1, the two point queries (executed in parallel) involve two more rounds between the verifier and the prover, for an overall three-round SIP. Since the entries of v never exceed m, each POINTQUERY protocol requires space and help costs O(d lognlog(m + log(d logn))). While the protocol of Theorem 3.2 achieves very small space and help costs, the prover s and verifier s runtimes could be as high as Ω(n d ), because processing a single stream token x (i) may require both parties to enumerate all balls containing x (i). Ultimately, this inefficiency is because the protocol assumes hardly anything about the nature of the distance function D and, as a result, does not get to exploit any structural information about the balls in B. 2 This very general setting, which includes metric spaces as special cases, captures several important distance functions such as the Bregman divergences from information theory and machine learning that satisfy neither symmetry nor the triangle inequality. C C C

12 12 Verifiable Stream Computation and Arthur-Merlin Communication To rectify this, we shall make the entirely reasonable assumption that the distance function D is efficiently computable in the rather mild sense that membership in a ball generated by D can be decided by a short (say, polynomial-length) formula. Accordingly, we shall express our bounds in terms of a parameter that captures this notion of efficient computation. Definition 3.3. Suppose the distance function D on X satisfies the assumptions for Theorem 3.2. Let Φ D : B X {0,1} be the ball membership function for D, i.e., Φ D (B D (z,r),x) = 1 x B D (z,r). Think of Φ D as a Boolean function of (3d logn)-bit inputs. We define the formula size complexity of D, denoted fsize(d), to be the length of the shortest de Morgan formula for Φ D. Since addition and multiplication of b-bit integers can both be computed by Boolean circuits in depth logb (see, e.g., [31,37]), they can be computed by Boolean formulae of size poly(b). It follows that for many natural distance functions D, including the Euclidean, Hamming, l 1, and l metrics (and in fact l p for all suitably small positive p), we have fsize(d) = poly(d,logn). Theorem 3.4. Suppose the premetric space (X, D) satisfies the assumptions made for Theorem 3.2. Then NEARESTNEIGHBOR on (X,D) has a three-round SIP, whose space and help costs are both at most O(fsize(D) log(m + fsize(d))), in which the verifier processes each stream update in time O(fsize(D)), and the prover runs in total time m poly(fsize(d)). In particular, if fsize(d) = poly(d,logn), as is the case for many natural distance functions D, then the space and help costs are both poly(d,logm,logn), the verifier runs in time poly(d,logn) per stream update, and the prover runs in total time m poly(d,logn). We defer a proof of Theorem 3.4 to the full version of the paper, but the high level idea that allows us to avoid the high runtimes of the previous protocol is as follows. Essentially, the SIP of Theorem 3.2 ran our polynomial evaluation protocol on a multilinear extension of the vector v defined by the derived stream. That SIP took v to be a completely arbitrary table of values. As a result, the verifier s computation evaluating the multilinear extension at a random point became costly. The honest prover incurred similar costs. A closer examination of the nature of v reveals that if D is a reasonable distance function, then v itself has plenty of structure. In particular, an appropriate higher degree extension of v can in fact be evaluated much more efficiently (by both the verifier and the prover) than the above multilinear extension. 3.3 Range Counting Queries Let U be any data universe and R 2 U a set of ranges. In the RANGECOUNT problem, the data stream σ = x (1),...,x (m),r specifies a sequence of universe elements x (i) U, followed by a query or target range R R. The goal is to output {i : x (i) R }, i.e., the number of elements in the target range that appeared in the stream. We easily obtain a two-round streaming interactive proof for the RANGECOUNT problem with cost bounded by O(log R log( R m)). The verifier simply runs a POINTQUERY on the derived stream σ defined to have data universe R. σ is obtained from σ as follows: on each stream update x (i) U, the verifier inserts into σ one copy of each range R R such that x (i) R. The range count problem is equivalent to a POINTQUERY on σ, with the target item being R, and we obtain the following theorem. Theorem 3.5. There is a two-round SIP with O(log R log( R m)) cost for RANGECOUNT. In particular, for spaces of bounded shatter dimension ρ, log R = ρ logm = O(logm). The above protocol also implies a three-round SIP for the problem of linear classification, a core problem in machine learning. Just like the protocol for NEARESTNEIGHBOR invokes a two-round protocol for

13 A. Chakrabarti, G. Cormode, A. McGregor, J. Thaler, and S. Venkatasubramanian 13 INDEX, an SIP for linear classification (find a hyperplane that separates red and blue points) verifies that the proposed hyperplane is empty of red points on one side and blue points on the other using the above two-round RANGECOUNT protocol. The prover and verifier in the protocol of Theorem 3.5 may require time Ω( R ) per stream update. This could be prohibitively large. However, we can obtain savings analogous to Theorem 3.4 if we make a mild efficient computability assumption on our ranges. Specifically, suppose there exists a (poly(s)-time uniform) de Morgan formula Φ of length S that takes as input a binary string representing a point x (i) U, as well as the label of a range R R and outputs a bit that is 1 if and only if x (i) R. We then obtain the following more practical SIP. Theorem 3.6. Suppose membership in ranges from R can be decided by de Morgan formulas of length S as above. Then there is a two-round SIP for RANGECOUNT on R, with costs at most O(Slog(m + S)), in which the verifier runs in time O(S) per stream update, and the prover runs in total time m poly(s). 3.4 Median and Selection Queries We give a three-round SIP for SELECTION, of which MEDIAN is a special case. In the SELECTION problem, defined over data universe U = [n], the data stream σ = x (1),...,x (m),ρ is a sequence of elements from [n], followed by a desired rank ρ [m]. For i [n], let f i := { j : x ( j) = i} denote the number of times element i appears in the stream. Given a desired rank ρ [m], the goal is to output an element j [n] such that f k < ρ and f k m ρ. (3) k< j k> j MEDIAN is the special case of SELECTION when ρ = m/2. Our three-round SIPs for SELECTION essentially work by reducing to the RANGECOUNT problem, but an extra round is required for the prover to send the desired element j to the verifier. Theorem 3.7. There is a three-round SIP for SELECTION with cost at most O(lognlog(m+logn)) in which the verifier runs in time poly(logn,logm) per update, and the prover runs in total time m poly(logn,logm). The proof of Theorem 3.7 is deferred to the full version of the paper. 3.5 Pattern Matching Queries In the pattern matching with wildcards problem, denoted PMW, we are given a stream σ representing text T = (t 1,...,t m ) {0,1, } m followed by a pattern P = (p 1,..., p q ) {0,1, } q. The wildcard symbol is interpreted as don t care, and the pattern P is said to occur at location i in t if, for every position j in P, either p j = t i+ j or at least one of p j and t i+ j is the wildcard symbol. The PMW problem is to determine the number of locations at which P occured in T. PATTERNMATCHING refers to the special case where don t care symbols are not permitted. We focus on a binary alphabet; a larger alphabet U can be handled by replacing each character in U with its binary representation, growing the parameter q by a factor of log U. Pattern matching, both with and without wildcards, has been extensively studied within the algorithmic literature, with applications ranging from internet search to computational genetics (see e.g. [11, 20] and the references therein). Verifiable protocols for pattern matching enable searching in the cloud, and complements work on searching in encrypted data within the cloud (e.g. [7]). Cormode et al. [13] described and implemented an SIP for PMW that required roughly Θ(log 2 m) rounds and C C C

14 14 Verifiable Stream Computation and Arthur-Merlin Communication had space help costs bounded by Θ(log 2 m); Concretely, their implementation required well over 1,000 rounds, even for quite small streams (of length 2 17 ). In stark contrast, our new protocol requires the optimal number of rounds: two. Theorem 3.8. There is a 2-round SIP for PMW with space and help costs at most O(qlog(q+m)), in which the verifier runs in time O(q) per stream update, and the prover runs in total time m poly(q). The proof of Theorem 3.8 is deferred to the full version of the paper. We remark that the PMW protocol of Theorem 3.8 can be run even if the verifier only knows an upper bound on the length q of the pattern. This is because, for any q q, a pattern P {0,1, } q is equivalent to the pattern P {0,1, } q obtained from P by concatenating q q wildecard symbols to P. 4 Communication Protocols and Complexity Classes We now turn to the study of communication complexity classes motivated by a desire to understand streaming interactive proofs (SIPs) from a complexity-theoretic viewpoint. In this section, we lay out the necessary definitions and terminology to rigorously discuss the notions outlined in Section 1.3. In the next section we prove the many parts of Result Definitions Communication problems arise naturally out of data stream problems if we suppose Alice holds a prefix of the input stream, and Bob the remaining suffix. The primary goal of such reductions is to obtain space lower bounds on data stream algorithms, so we are free to split the stream at any place we like. For example, most data stream problems in Section 3 are query problems, where the input consists of a streamed data set, S, followed by a query, q, to apply to S. In this case, it would be natural to split the input by giving S to Alice and q to Bob. Communication problems that will play an important role in this paper include the index problem INDEX : {0,1} n [n] {0,1} where [n] := {1,...,n} and INDEX(x, j) = x j, the set-intersection and set-disjointness problems INTER, DISJ : {0,1} n {0,1} n {0,1} where INTER(x,y) = DISJ(x,y) = n i=1 (x i y i ), and the median relation MED : [n] m [n] m [n], where inputs x,y [n] m [n] m are interpreted as two halves of a list of numbers, and the valid output(s) corresponds to the median(s) of the combined list Communication Complexity Classes All our communication models provide random coins and allow two-sided error probability up to a constant; when unspecified, this constant defaults to 1/3. Given a communication model C, we denote the corresponding complexity measure of a problem f by C( f ). Following Babai et al. [5], we also denote by C the corresponding complexity class, defined as the set of all functions f : {0,1} n {0,1} n {0,1} such that C( f ) = (logn) O(1), i.e., functions that are easy in the model C. We let R [k,a] denote the model of randomized communication complexity where Alice and Bob exchange k 1 messages in total with Alice sending the first; R [k,b] is similar, except that Bob starts. In the MA model, the super-player Merlin, who sees all of the input, broadcasts a message at the start, following which Alice and Bob run a (two-way, arbitrary-round) randomized verification protocol. The MA [k,a] and MA [k,b] models are restrictions of MA where Merlin speaks only to Bob 3 and 3 Our definition breaks symmetry between Alice and Bob because our eventual goal is to study online protocols.

Stream Computation and Arthur- Merlin Communication

Stream Computation and Arthur- Merlin Communication Justin Thaler, Yahoo! Labs Joint Work with: Amit Chakrabarti, Dartmouth Graham Cormode, University of Warwick Andrew McGregor, Umass Amherst Suresh Venkatasubramanian,