Proofs Regarding Specification of Spark Aggregation

Proofs Regarding Specification of Spark Aggregation Yu-Fang Chen 1, Chih-Duo Hong 1, Ondřej Lengál 1,2, Shin-Cheng Mu 1, Nishant Sinha 3, and Bow-Yaw Wang 1 1 Academia Sinica, Taiwan 2 Brno University of Technology, Czech Republic 3 IBM Research, India Laws regarding monads Recall the monad laws f =<< return x = f x, (1) return =<< m = m, (2) f =<< (g =<< m) = (λx f =<< g x) =<< m. (3) Monadic application and composition are defined as: (<=<) :: (b m c) (a m b) a m c (f <=< g) x = f =<< g x ( $ ) :: (a b) m a m b f $ m = (return f ) =<< m ( ) :: (b c) (a m b) (a m c) f g = ((return f )=<<) g Laws concerning them include: (f g) x = f $ g x, (4) f =<< (g $ m) = (f g) =<< m, (5) f $ (g $ m) = (f g) $ m, (6) f (g m) = (f g) m, (7) f <=< (g h) = (f g) <=< h, (8) f (g <=< h) = (f g) <=< h. (9) Proofs of the properties above will be given later, so as not to distract us from the main theorems. 1

Regarding monad-plus, we want ([]) to be associative, with mzero as identity. Monadic bind distributes into ([]) from the end: f =<< (m [] n) = (f =<< m) [] (f =<< n). (10) It is less mentioned, but not uncommon, to demand that ([]) is also commutative and idempotent. 1 Folding and Shuffling Lemma 1. Given ( ) :: a b b, we have foldr ( ) z insert x = return foldr ( ) z (x:), provided that x (y z) = y (x z) for all x,y :: a and z :: b. Proof. Prove foldr ( ) z $ insert x xs = return (foldr ( ) z (x : xs)). Induction on xs. Case xs := [ ]. foldr ( ) z $ insert x [ ] = { definition of ( $ ) } (return foldr ( ) z) =<< insert x [ ] = { definition of insert } (return foldr ( ) z) =<< return [x] = { monadic law (1) } return (foldr ( ) z [x]). Case xs := y : xs. foldr ( ) z $ insert x (y : xs) = { definition of ( $ ) } (return foldr ( ) z) =<< insert x (y : xs) = { definition of insert } (return foldr ( ) z) =<< (return (x : y : xs) [] ((y:) $ insert x xs)) = { by (10) } return (foldr ( ) z (x : y : xs)) [] ((return foldr ( ) z (y:)) =<< insert x xs). Focus on the second branch: (return foldr ( ) z (y:)) =<< insert x xs = { definition of foldr } (return (y ) foldr ( ) z) =<< insert x xs = { by (5) } (return (y )) =<< (foldr ( ) z $ insert x xs) = { induction } 2

(return (y )) =<< return (foldr ( ) z (x : xs)) = { monadic law (1) } return (y foldr ( ) z (x : xs)) = { definition of foldr } return (y (x foldr ( ) z xs)) = { since x (y z) = y (x z) } return (foldr ( ) z (x : y : xs)). Thus we have (foldr ( ) z insert x) (y : xs) = { calculatiion above } return (foldr ( ) z (x : y : xs)) [] return (foldr ( ) z (x : y : xs)) = { idempotence of ([]) } return (foldr ( ) z (x : y : xs)). Lemma 2. Given ( ) :: a b b, we have foldr ( ) z shuffle! = return foldr ( ) z, provided that x (y z) = y (x z) for all x,y :: a and z :: b. Proof. Prove that foldr ( ) z $ shuffle xs = return (foldr ( ) z xs). Induction on xs. Case xs := [ ] foldr ( ) z $ shuffle! [ ] = { definitions of ( $ ) and shuffle! } (return foldr ( ) z) =<< return [ ] = { monadic law (1) } return (foldr ( ) z [ ]). Case xs := x : xs. foldr ( ) z $ shuffle! (x : xs) = { definition of shuffle! } foldr ( ) z $ (insert x =<< shuffle xs) = { monadic law (3) } (λxs foldr ( ) z $ insert x xs) =<< shuffle! xs = { Lemma 1 } (λxs return (foldr ( ) z (x : xs))) =<< shuffle! xs = { definition of foldr } (return (x ) foldr ( ) z) =<< shuffle xs = { by (5) } (return (x )) =<< (foldr ( ) z $ shuffle xs) = { induction } (return (x )) =<< (return (foldr ( ) z xs)) 3

= { monadic law (1) } return (x foldr ( ) z xs) = { definition of foldr } return (foldr ( ) z (x : xs)). 2 Map, Filter, and Shuffling Lemma 3. insert (f x) map f = map f insert x. Proof. Prove that map f $ insert x xs = insert (f x) (map f xs) for all xs, by induction on xs. Case xs := y : xs. map f $ insert x (y : xs) = { definition of insert } map f $ (return (x : y : xs) [] ((y:) $ insert x xs)) = { (10), definition of ( $ ) } (map f $ return (x : y : xs)) [] (map f $ ((y:) $ insert x xs)). The first branch, by definition of ( $ ) and monadic law (1), simplifies to return (map f (x: y : xs)). The second branch: map f $ ((y:) $ insert x xs) = { (6) } (map f (y:)) $ insert x xs = { definition of map } ((f y:) map f ) $ insert x xs = { (6) } (f y:) $ (map f $ insert x xs) = { induction } (f y:) $ (insert (f x) (map f xs)). Thus we have map f $ insert x (y : xs) = { calculation above } return (f x : f y : map f xs) [] ((f y:) $ (insert (f x) (map f xs))) = { definition of insert } insert (f x) (f y : map f xs) = { definition of map } insert (f x) (map f (y : xs)). Lemma 4. shuffle! map f = map f shuffle!. Lemma 5. shuffle! filter p = filter p shuffle!. 4

3 Homomorphism, etc Lemma 6. h = hom ( ) z if and only if foldr ( ) z map h = h concat. Proof. A Ping-pong proof. Direction ( ). Prove foldr ( ) z (map h xss) = h (concat xss) by induction on xss. Case xss := [ ]: foldr ( ) z (map h [ ]) = foldr ( ) z [ ] = z = h (concat [ ]). Case xss := xs : xss: foldr ( ) z (map h (xs : xss)) = h xs foldr ( ) z (map h xss) = { induction } h xs h (concat xss) = { h homomorphism } h (concat (xs : xss)). Direction ( ). Show that h satisfies the properties being a list homomorphism. On empty list: h [ ] = h (concat [ ]) = { assumption } foldr ( ) z (map h [ ]) = z. On concatentation: h (xs + ys) = h (concat [xs,ys]) = { assumption } foldr ( ) z (map h [xs,ys]) = h xs (h ys z) = h xs h ys. Lemma 7. Let ( ) :: b b b be associative on img (foldr ( ) z) with z as its identity, where ( ) :: a b b. We have foldr ( ) z = hom ( ) z if and only if x (y w) = (x y) w for all x :: a and y,w img (foldr ( ) z). Proof. A Ping-pong proof. 5

Direction ( ). We show that foldr ( ) z satisfies the homomorphic properties. It is immediate that foldr ( ) z [ ] = z. For xs + ys, note that The aim is thus to prove that foldr ( ) z (xs + ys) = foldr ( ) (foldr ( ) ys) xs. foldr ( ) (foldr ( ) ys) xs = (foldr ( ) z xs) (foldr ( ) z ys). We perform an induction on xs. The case when xs := [ ] trivially holds. For xs := x : xs, we reason: foldr ( ) (foldr ( ) ys) (x : xs) = x foldr ( ) (foldr ( ) ys) xs = { induction } x ((foldr ( ) z xs) (foldr ( ) z ys)) = { assumption: x (y w) = (x y) w } (x (foldr ( ) z xs)) (foldr ( ) z ys) = (foldr ( ) z (x : xs)) (foldr ( ) z ys). Direction ( ). Given foldr ( ) z = hom ( ) z. Let y = foldr ( ) z xs and w = foldr ( ) z ys for some xs and ys. We reason: x (y w) = x (foldr ( ) z xs foldr ( ) z ys) = { since foldr ( ) z = hom ( ) z } x (foldr ( ) z (xs + ys)) = foldr ( ) z (x : xs + ys) { since foldr ( ) z = hom ( ) z } = foldr ( ) z (x : xs) foldr ( ) z ys = (x foldr ( ) z xs) foldr ( ) z ys = (x y) w. 4 Aggregation Lemma 8. Given ( ) :: a b b and define xs w = foldr ( ) w xs. We have foldr ( ) z concat = foldr ( ) z. Proof. By foldr fusion the proof obligation is foldr ( ) z (xs + ys) = foldr ( ) (foldr ( ) z ys) xs. Induction on xs. Case xs := [ ]: 6

foldr ( ) (foldr ( ) z ys) [ ] = foldr ( ) z ys = foldr ( ) z ([ ] + ys). Case xs := x : xs: foldr ( ) (foldr ( ) z ys) (x : xs) = x foldr ( ) (foldr ( ) z ys) xs = { induction } x foldr ( ) z (xs + ys) = foldr ( ) z (x : xs + ys). Theorem 9. aggregate z ( ) ( ) = return foldr ( ) z map (foldr ( ) z), provided that ( ) is associative, commutative, and has z as identity. aggregate z ( ) ( ) = { definition of aggregate } foldr ( ) z (shuffle! map (foldr ( ) z)) = { Lemma 4 } foldr ( ) z (map (foldr ( ) z) shuffle!) = { by (7) } (foldr ( ) z map (foldr ( ) z)) shuffle! = { Lemma 2 } return foldr ( ) z map (foldr ( ) z). The last step holds because by foldr-map fusion, for all h, foldr ( ) z map h = foldr ( ) z where xs w = h xs w, and ( ) satisfies that xs (ys w) = ys (xs w) if ( ) is associative, commutative, and has z as identity. Corollary 10. aggregate z ( ) ( ) = return foldr ( ) z concat, provided that ( ) is associative, commutative, and has z as identity, and that foldr ( ) z = hom ( ) z. aggregate z ( ) ( ) = { Theorem 9 } return foldr ( ) z map (foldr ( ) z) = { Lemma 6 } return foldr ( ) z concat. 7

Lemma 11. If aggregate z ( ) ( ) = return foldr ( ) z concat, and shuffle! xss = return yss [] m, we have foldr ( ) z (concat xss) = foldr ( ) z (map (foldr ( ) z) xss) = foldr ( ) z (map (foldr ( ) z) yss). Proof. We assume the following two properties of MonadPlus: 1. m 1 [] m 2 = return x implies that m 1 = m 2 = return x. 2. return x 1 = return x 2 implies that x 1 = x 2. For our problem, if shuffle! xss = return yss [] m, we have return foldr ( ) z concat $ xss = { assumption } aggregate z ( ) ( ) $ xss = { calculation in the previous lemma } (foldr ( ) z map (foldr ( ) z)) $ shuffle! xss = { shuffle! xss = return yss [] m } (return foldr ( ) z map (foldr ( ) z) $ yss) [] ((foldr ( ) z map (foldr ( ) z)) $ m). Theorem 12. If aggregate z ( ) ( ) = return foldr ( ) z concat, we have that ( ), when restricted to values in img (foldr ( ) z), is associative, commutative, and has z as identity. Proof. In the discussion below, let x,y, and w be in img (foldr ( ) z). That is, there exists xs, ys, and ws such that x = foldr ( ) z xs, y = foldr ( ) z ys, and w = foldr ( ) z ws. Identity. We reason: y = foldr ( ) z (concat [xs]) = { shuffle! [xs] = return [xs] [] mzero, Lemma 11 } foldr ( ) z (map (foldr ( ) z) [xs]) = y z. Thus z is a right identity of ( ). y = foldr ( ) z (concat [[ ],xs]) = { shuffle! [[ ],xs] = return [[ ],xs] [] m, Lemma 11 } foldr ( ) z (map (foldr ( ) z) [[ ],xs]) = z (y z) = { z is a right identity of ( ) } z y. 8

Thus z is also a left identity of ( ). Commutativity. We reason: x y = { z is a right identity } x (y z) = foldr ( ) z (map (foldr ( ) z) [xs,ys]) = { shuffle [xs, ys] = return [ys, xs] [] m, Lemma 11 } foldr ( ) z (map (foldr ( ) z) [ys, xs]) = y (x z) y x. Associativity. We reason: x (y w) = { z is a right identity } x (y (w z)) = foldr ( ) z (map (foldr ( ) z) [xs,ys,ws]) = { } foldr ( ) z (map (foldr ( ) z) [ws,xs,ys]) = w (x (y z)) = { z is a right identity } w (x y) = { ( ) commutative } (x y) w. Theorem 13. If aggregate z ( ) ( ) = return foldr ( ) z concat, we have foldr ( ) z = hom ( ) z. Proof. Apparently foldr ( ) z [ ] = z. We are left with proving the case for concatenation. foldr ( ) z (xs + ys) = foldr ( ) z (concat [xs, ys]) = { Lemma 11 } foldr ( ) z (map (foldr ( ) z) [xs, ys]) = foldr ( ) z xs (foldr ( ) z ys z) = { Theorem 12, z is identity } foldr ( ) z xs foldr ( ) z ys. Corollary 14. Given ( ) :: a b b and ( ) :: b b b. aggregate z ( ) ( ) = return foldr ( ) z concat if and only if (img (foldr ( ) z),( ),z) forms a commutative monoid, and that foldr ( ) z = hom ( ) z. Proof. A conclusion following from Corollary 10, Theorem 12, and Theorem 13. 9

5 Tree Aggregate Lemma 15. apply! ( ) = return foldr ( ) z if ( ) is associative, commutative, and has z as identity. Proof. To do. Theorem 16. treeaggregate z ( ) ( ) = return foldr ( ) z map (foldr ( ) z) if ( ) is associative, commutative, and has z as identity. Proof. treeaggregate z ( ) ( ) = { definition of treeaggregate } apply! ( ) <=< (shuffle! map (foldr ( ) z)) = { Lemma 4 } apply! ( ) <=< (map (foldr ( ) z) shuffle!) = { by (8) } (apply! ( ) map (foldr ( ) z)) <=< shuffle! = { Lemma 15 } (return foldr ( ) z map (foldr ( ) z)) <=< shuffle! = { definitions of (<=<) and ( ) } (foldr ( ) z map (foldr ( ) z)) shuffle! = { Lemma 2 } return foldr ( ) z map (foldr ( ) z). 6 Aggregation by Key For proofs, it is helpful to have an inductive definition of ( ˆ z ). (k,v) ˆ z [ ] = [(k,v z)] (k,v) ˆ z ((j,u) : xs) k = = j = (k,v u) : xs otherwise = (j,u) : ((k,v) ˆ z xs). Define the following auxiliary functions: keq k = (k = =) fst, lookup :: Eq k k a PairRDD k a a lookup k z = hd z map snd filter (keyeq k) concat, filterkrdd :: Eq k k PairRDD k a PairRDD k a filterkrdd k = filter ( null) map (filter (keq k)). Lemma 17. For all j, k, xs, v, z, and ( ), we have: 1. filter (keq k) ((k,v) ˆ z xs) = (k,v) ˆ z filter (keq k) xs. 10

2. filter (keq k) ((j,v) ˆ z xs) = filter (keq k) xs, if k j. Lemma 18. For all k, ( ), and z, we have filter (keq k) foldr ( ˆ z ) [ ] = foldr ( ˆ z ) [ ] filter (keq k). Lemma 19. For all k, ( ), and z, we have foldr ( ˆ z ) [ ] (filter (keq k) xs) = [(k,foldr ( ) z map snd filter (keq k) $ xs)] if there exists at least one (k,v) in xs. Otherwise foldr ( ˆ z ) [ ] (filter (keq k) xs) = [ ]. Corollary 20. As a corollary, we have that for all k, ( ), and z, hd z shuffle! map value foldr ( ˆ z ) [ ] filter (keq k) = return foldr dot z map value filter (keq k). It follows from Lemma 19 because, when the input is empty or does not contain entries with key k, both sides reduce to return z. Corollary 21. For all k, z and ( ) we have: concat map (map value filter (keyeq k) foldr ( ˆ z ) [ ]) = map (foldr ( ) z) filterkrdd k. Theorem 22. If..., we have lookup k z aggregatebykey z ( ) ( ) = return foldr ( ) z map (foldr ( ) z) filterkrdd k. lookup k z aggregatebykey z ( ) ( ) = { definitions } (hd z map snd filter (keq k) concat) repartition! <=< (foldr ( ˆ z ) [ ] concat) map! (foldr ( ˆ z ) [ ]) = { concat part = return } (hd z map snd filter (keq k)) shuffle! <=< (foldr ( ˆ z ) [ ] concat) map! (foldr ( ˆ z ) [ ]) = { Lemma 4 and 5 } hd z shuffle! <=< (map snd filter (keq k) foldr ( ˆ z ) [ ] concat) map! (foldr ( ˆ z ) [ ]) = { Lemma 4 } hd z shuffle! <=< (map snd filter (keq k) foldr ( ˆ z ) [ ] concat map (foldr ( ˆ z ) [ ])) shuffle! = { by (8) and (9) } (hd z shuffle! map snd filter (keq k) foldr ( ˆ z ) [ ] concat map (foldr ( ˆ z ) [ ])) <=< shuffle!. 11

We work on the part before the right-most shuffle!: hd z shuffle! map snd filter (keq k) foldr ( ˆ z ) [ ] concat map (foldr ( ˆ z ) [ ]) = { Lemma 18 } hd z shuffle! map snd foldr ( ˆ z ) [ ] filter (keq k) concat map (foldr ( ˆ z ) [ ]) = { Corollary 20 } return foldr ( ) z map snd filter (keq k) concat map (foldr ( ˆ z ) [ ]) = { naturalty } return foldr ( ) z concat map (map snd filter (keq k) foldr ( ˆ z ) [ ]) = { Corollary 21 } return foldr ( ) z map (foldr ( ) z) filterkrdd k. Back to the main proof: lookup k z aggregatebykey z ( ) ( ) = { calculation above } (return foldr ( ) z map (foldr ( ) z) filter ( null) map (filter (keq k))) <=< shuffle! = { definitions of (<=<) and ( ) } (foldr ( ) z map (foldr ( ) z) filter ( null) map (filter (keq k))) shuffle! = { Lemma 4 and 5 } (foldr ( ) z map (foldr ( ) z)) shuffle! filterkrdd k = { Lemma 2 } return foldr ( ) z map (foldr ( ) z) filterkrdd k. lookup k z $ (aggregatemessageswithactivevertices sendmsg ( ) active (Graph vrdd erdd)) = { definitions } lookup k z $ reducebykey ( ) (map (concatmap sendifactive) erdd) = { } return foldl ( ) z concat filterkrdd k map (concatmap sendifactive) $ erdd = { naturalty laws } return foldl ( ) z concatmap sendifactive map value filter (keq k) concat $ erdd. 7 Proofs of monadic properties Proving (5) f =<< (g $ m) = (f g) =<< m. f =<< (g $ m) = { definition of ( $ ) } 12

f =<< ((return g) =<< m) = { monadic law (3) } (λx f =<< return (g x)) =<< m = { monadic law (1) } (λx f (g x)) =<< m = (f g) =<< m. Proving (6) f $ (g $ m) = (f g) $ m. f $ (g $ m) = { definition of ( $ ) } (return f ) =<< (g $ m) = { by (5) } (return f g) =<< m = { definition of ( $ ) } (f g) =<< m. For the next results we prove a lemma: (f =<<) (g=<<) = { η intro. } (λm f =<< (g =<< m)) = { monadic law (3) } (λm (λy f =<< g y) =<< m) = { η reduction } (((f =<<) g)=<<). (f =<<) (g=<<) = (((f =<<) g)=<<). (11) Proving (7) f (g m) = (f g) m. f (g m) = { definition of ( ) } ((return f )=<<) ((return g)=<<) m = { by (11) } ((((return f )=<<) return g)=<<) m = { monadic law (1) } ((return f g)=<<) m = { definition of ( ) } (f g) m. 13

Proving (8) f <=< (g h) = (f g) <=< h. f <=< (g h) = { definitions of (<=<) } (f =<<) ((return g)=<<) h = { by (11) } (((f =<<) return g)=<<) h = { monadic law (1) } (((f g)=<<) h = { definition of (<=<) } (f g) <=< h. Proving (9) f (g <=< h) = (f g) <=< h. f (g <=< h) = { definitions of (<=<) and ( ) } ((return f )=<<) (g=<<) h = { by (11) } ((((return f )=<<) g)=<<) h = { definition of ( ) } ((f g)=<<) h = { definition of (<=<) } (f g) <=< h. 14