MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

Size: px

Start display at page:

Download "MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland"

Myron Reed
5 years ago
Views:

1 MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first semester Academic year 2017/18 (winter course) 1 / 54

2 Review of the previous lectures Mining of massive datasets. Evolution of database systems. Dimensional modeling. ETL and OLAP systems: Extraction, transformation, load. ROLAP, MOLAP, HOLAP. Challenges in OLAP systems: a huge number of possible aggregations to compute. 2 / 54

3 Outline 1 Motivation 2 MapReduce 3 Spark 4 Algorithms in Map-Reduce 5 Summary 3 / 54

4 Outline 1 Motivation 2 MapReduce 3 Spark 4 Algorithms in Map-Reduce 5 Summary 4 / 54

5 Motivation Traditional DBMS vs. NoSQL 5 / 54

6 Motivation Traditional DBMS vs. NoSQL New emerging applications: search engines, social networks, online shopping, online advertising, recommender systems, etc. 5 / 54

7 Motivation Traditional DBMS vs. NoSQL New emerging applications: search engines, social networks, online shopping, online advertising, recommender systems, etc. New computational challenges: WordCount, PageRank, etc. 5 / 54

8 Motivation Traditional DBMS vs. NoSQL New emerging applications: search engines, social networks, online shopping, online advertising, recommender systems, etc. New computational challenges: WordCount, PageRank, etc. New ideas: 5 / 54

9 Motivation Traditional DBMS vs. NoSQL New emerging applications: search engines, social networks, online shopping, online advertising, recommender systems, etc. New computational challenges: WordCount, PageRank, etc. New ideas: Scaling-out instead of scaling-up 5 / 54

10 Motivation Traditional DBMS vs. NoSQL New emerging applications: search engines, social networks, online shopping, online advertising, recommender systems, etc. New computational challenges: WordCount, PageRank, etc. New ideas: Scaling-out instead of scaling-up Move-code-to-data 5 / 54

11 Outline 1 Motivation 2 MapReduce 3 Spark 4 Algorithms in Map-Reduce 5 Summary 6 / 54

12 MapReduce-based systems Accessible run on large clusters of commodity machines or on cloud computing services such as Amazon s Elastic Compute Cloud (EC2). 7 / 54

13 MapReduce-based systems Accessible run on large clusters of commodity machines or on cloud computing services such as Amazon s Elastic Compute Cloud (EC2). Robust are intended to run on commodity hardware; desinged with the assumption of frequent hardware malfunctions; they can gracefully handle most such failures. 7 / 54

14 MapReduce-based systems Accessible run on large clusters of commodity machines or on cloud computing services such as Amazon s Elastic Compute Cloud (EC2). Robust are intended to run on commodity hardware; desinged with the assumption of frequent hardware malfunctions; they can gracefully handle most such failures. Scalable scales linearly to handle larger data by adding more nodes to the cluster. 7 / 54

15 MapReduce-based systems Accessible run on large clusters of commodity machines or on cloud computing services such as Amazon s Elastic Compute Cloud (EC2). Robust are intended to run on commodity hardware; desinged with the assumption of frequent hardware malfunctions; they can gracefully handle most such failures. Scalable scales linearly to handle larger data by adding more nodes to the cluster. Simple allow users to quickly write efficient parallel code. 7 / 54

16 MapReduce: Two simple procedures Word count: A basic operation for every search engine. Matrix-vector multiplication: A fundamental step in many algorithms, for example, in PageRank. 8 / 54

17 MapReduce: Two simple procedures Word count: A basic operation for every search engine. Matrix-vector multiplication: A fundamental step in many algorithms, for example, in PageRank. How to implement these procedures for efficient execution in a distributed system? 8 / 54

18 MapReduce: Two simple procedures Word count: A basic operation for every search engine. Matrix-vector multiplication: A fundamental step in many algorithms, for example, in PageRank. How to implement these procedures for efficient execution in a distributed system? How much can we gain by such implementation? 8 / 54

19 MapReduce: Two simple procedures Word count: A basic operation for every search engine. Matrix-vector multiplication: A fundamental step in many algorithms, for example, in PageRank. How to implement these procedures for efficient execution in a distributed system? How much can we gain by such implementation? Let us focus on the word count problem... 8 / 54

20 Word count Count the number of times each word occurs in a set of documents: Do as I say, not as I do. Word Count as 2 do 2 i 2 not 1 say 1 9 / 54

21 Word count Let us write the procedure in pseudo-code for a single machine: 10 / 54

22 Word count Let us write the procedure in pseudo-code for a single machine: d e f i n e wordcount as M u l t i s e t ; f o r each document i n documentset { T = t o k e n i z e ( document ) ; } f o r each token i n T { wordcount [ token ]++; } d i s p l a y ( wordcount ) ; 10 / 54

23 Word count Let us write the procedure in pseudo-code for many machines: 11 / 54

24 Word count Let us write the procedure in pseudo-code for many machines: First step: d e f i n e wordcount as M u l t i s e t ; f o r each document i n documentsubset { T = t o k e n i z e ( document ) ; f o r each token i n T { wordcount [ token ]++; } } sendtosecondphase ( wordcount ) ; 11 / 54

25 Word count Let us write the procedure in pseudo-code for many machines: First step: d e f i n e wordcount as M u l t i s e t ; f o r each document i n documentsubset { T = t o k e n i z e ( document ) ; f o r each token i n T { wordcount [ token ]++; } } sendtosecondphase ( wordcount ) ; Second step: d e f i n e totalwordcount as M u l t i s e t ; f o r each wordcount r e c e i v e d from f i r s t P h a s e { m u l t i s e t A d d ( totalwordcount, wordcount ) ; } 11 / 54

26 Word count To make the procedure work properly across a cluster of distributed machines, we need to add a number of functionalities: Store files over many processing machines (of phase one). Write a disk-based hash table permitting processing without being limited by RAM capacity. Partition the intermediate data (that is, wordcount) from phase one. Shuffle the partitions to the appropriate machines in phase two. Ensure fault tolerance. 12 / 54

27 MapReduce MapReduce programs are executed in two main phases, called mapping and reducing: 13 / 54

28 MapReduce MapReduce programs are executed in two main phases, called mapping and reducing: Map: the map function is written to convert input elements to key-value pairs. 13 / 54

29 MapReduce MapReduce programs are executed in two main phases, called mapping and reducing: Map: the map function is written to convert input elements to key-value pairs. Reduce: the reduce function is written to take pairs consisting of a key and its list of associated values and combine those values in some way. 13 / 54

30 MapReduce The complete data flow: Input Output map (<k1, v1>) list(<k2, v2>) reduce (<k2, list(<v2>) list(<k3, v3>) 14 / 54

31 MapReduce Figure: The complete data flow 15 / 54

32 MapReduce The complete data flow: 16 / 54

33 MapReduce The complete data flow: The input is structured as a list of key-value pairs: list(<k1,v1>). 16 / 54

34 MapReduce The complete data flow: The input is structured as a list of key-value pairs: list(<k1,v1>). The list of key-value pairs is broken up and each individual key-value pair, <k1,v1>, is processed by calling the map function of the mapper (the key k1 is often ignored by the mapper). 16 / 54

35 MapReduce The complete data flow: The input is structured as a list of key-value pairs: list(<k1,v1>). The list of key-value pairs is broken up and each individual key-value pair, <k1,v1>, is processed by calling the map function of the mapper (the key k1 is often ignored by the mapper). The mapper transforms each <k1,v1> pair into a list of <k2,v2> pairs. 16 / 54

36 MapReduce The complete data flow: The input is structured as a list of key-value pairs: list(<k1,v1>). The list of key-value pairs is broken up and each individual key-value pair, <k1,v1>, is processed by calling the map function of the mapper (the key k1 is often ignored by the mapper). The mapper transforms each <k1,v1> pair into a list of <k2,v2> pairs. The key-value pairs are processed in arbitrary order. 16 / 54

37 MapReduce The complete data flow: The input is structured as a list of key-value pairs: list(<k1,v1>). The list of key-value pairs is broken up and each individual key-value pair, <k1,v1>, is processed by calling the map function of the mapper (the key k1 is often ignored by the mapper). The mapper transforms each <k1,v1> pair into a list of <k2,v2> pairs. The key-value pairs are processed in arbitrary order. The output of all the mappers are (conceptually) aggregated into one giant list of <k2,v2> pairs. All pairs sharing the same k2 are grouped together into a new aggregated key-value pair: <k2,list(v2)>. 16 / 54

38 MapReduce The complete data flow: The input is structured as a list of key-value pairs: list(<k1,v1>). The list of key-value pairs is broken up and each individual key-value pair, <k1,v1>, is processed by calling the map function of the mapper (the key k1 is often ignored by the mapper). The mapper transforms each <k1,v1> pair into a list of <k2,v2> pairs. The key-value pairs are processed in arbitrary order. The output of all the mappers are (conceptually) aggregated into one giant list of <k2,v2> pairs. All pairs sharing the same k2 are grouped together into a new aggregated key-value pair: <k2,list(v2)>. The framework asks the reducer to process each one of these aggregated key-value pairs individually. 16 / 54

39 WordCount in MapReduce Map: For a pair <k1,document> produce a sequence of pairs <token,1>, where token is a token/word found in the document. map( S t r i n g f i l e n a m e, S t r i n g document ) { L i s t <S t r i n g > T = t o k e n i z e ( document ) ; } f o r each token i n T { emit ( ( S t r i n g ) token, ( I n t e g e r ) 1) ; } 17 / 54

40 WordCount in MapReduce Reduce For a pair <word, list(1, 1,..., 1)> sum up all ones appearing in the list and return <word, sum>, where sum is the sum of ones. r e d u c e ( S t r i n g token, L i s t <I n t e g e r > v a l u e s ) { I n t e g e r sum = 0 ; f o r each v a l u e i n v a l u e s { sum = sum + v a l u e ; } } emit ( ( S t r i n g ) token, ( I n t e g e r ) sum ) ; 18 / 54

41 Combiner and partitioner Beside map and reduce there are two other important elements that can be implemented within the MapReduce framework to control the data flow. 19 / 54

42 Combiner and partitioner Beside map and reduce there are two other important elements that can be implemented within the MapReduce framework to control the data flow. Combiner perform local aggregation (the reduce step) on the map node. 19 / 54

43 Combiner and partitioner Beside map and reduce there are two other important elements that can be implemented within the MapReduce framework to control the data flow. Combiner perform local aggregation (the reduce step) on the map node. Partitioner divide the key space of the map output and assign the key-value pairs to reducers. 19 / 54

44 Outline 1 Motivation 2 MapReduce 3 Spark 4 Algorithms in Map-Reduce 5 Summary 20 / 54

45 Spark Spark is a fast and general-purpose cluster computing system. 21 / 54

46 Spark Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. 21 / 54

47 Spark Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including: 21 / 54

48 Spark Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including: Spark SQL for SQL and structured data processing, 21 / 54

49 Spark Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including: Spark SQL for SQL and structured data processing, MLlib for machine learning, 21 / 54

50 Spark Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including: Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, 21 / 54

51 Spark Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including: Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. 21 / 54

52 Spark Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including: Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. For more check 21 / 54

53 Spark Spark uses Hadoop which is a popular open-source implementation of MapReduce. 22 / 54

54 Spark Spark uses Hadoop which is a popular open-source implementation of MapReduce. Hadoop works in a master/slave architecture for both distributed storage and distributed computation. 22 / 54

55 Spark Spark uses Hadoop which is a popular open-source implementation of MapReduce. Hadoop works in a master/slave architecture for both distributed storage and distributed computation. Hadoop Distributed File System (HDFS) is responsible for distributed storage. 22 / 54

56 Installation of Spark Download Spark from 23 / 54

57 Installation of Spark Download Spark from 23 / 54

58 Installation of Spark Download Spark from Untar the spark archive: 23 / 54

59 Installation of Spark Download Spark from Untar the spark archive: tar xvfz spark bin-hadoop2.7.tar 23 / 54

60 Installation of Spark Download Spark from Untar the spark archive: tar xvfz spark bin-hadoop2.7.tar To play with Spark there is no need to install HDFS / 54

61 Installation of Spark Download Spark from Untar the spark archive: tar xvfz spark bin-hadoop2.7.tar To play with Spark there is no need to install HDFS... But, you can try to play around with HDFS. 23 / 54

62 HDFS Create new directories: 24 / 54

63 HDFS Create new directories: hdfs dfs -mkdir /user 24 / 54

64 HDFS Create new directories: hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mynane 24 / 54

65 HDFS Create new directories: hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mynane Copy the input files into the distributed filesystem: 24 / 54

66 HDFS Create new directories: hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mynane Copy the input files into the distributed filesystem: hdfs dfs -put data.txt /user/myname/data.txt 24 / 54

67 HDFS Create new directories: hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mynane Copy the input files into the distributed filesystem: hdfs dfs -put data.txt /user/myname/data.txt View the files in the distributed filesystem: 24 / 54

68 HDFS Create new directories: hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mynane Copy the input files into the distributed filesystem: hdfs dfs -put data.txt /user/myname/data.txt View the files in the distributed filesystem: hdfs dfs -ls /user/myname/ 24 / 54

69 HDFS Create new directories: hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mynane Copy the input files into the distributed filesystem: hdfs dfs -put data.txt /user/myname/data.txt View the files in the distributed filesystem: hdfs dfs -ls /user/myname/ hdfs dfs -cat /user/myname/data.txt 24 / 54

70 WordCount in Hadoop import ja va. i o. IOException ; import j a v a. u t i l. S t r i n g T o k e n i z e r ; import org. apache. hadoop. c o n f. C o n f i g u r a t i o n ; import org. apache. hadoop. f s. Path ; import org. apache. hadoop. i o. I n t W r i t a b l e ; import org. apache. hadoop. i o. Text ; import org. apache. hadoop. mapreduce. Job ; import org. apache. hadoop. mapreduce. Mapper ; import org. apache. hadoop. mapreduce. Reducer ; import org. apache. hadoop. mapreduce. l i b. i n p u t. F i l e I n p u t F o r m a t ; import org. apache. hadoop. mapreduce. l i b. o u tput. F i l e O u t p u t F o r m a t ; p u b l i c c l a s s WordCount { p u b l i c s t a t i c c l a s s TokenizerMapper extends Mapper<Object, Text, Text, IntWritable >{ p r i v a t e f i n a l s t a t i c I n t W r i t a b l e one = new I n t W r i t a b l e ( 1 ) ; p r i v a t e Text word = new Text ( ) ; p u b l i c v o i d map( Object key, Text value, Context context ) throws IOException, InterruptedException { S t r i n g T o k e n i z e r i t r = new S t r i n g T o k e n i z e r ( v a l u e. t o S t r i n g ( ) ) ; w h i l e ( i t r. hasmoretokens ( ) ) { word. s e t ( i t r. nexttoken ( ) ) ; c o n t e x t. w r i t e ( word, one ) ; } } } (... ) 25 / 54

71 WordCount in Hadoop (... ) p u b l i c s t a t i c c l a s s IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { p r i v a t e IntWritable r e s u l t = new IntWritable ( ) ; p u b l i c v o i d r e d u c e ( Text key, I t e r a b l e <I n t W r i t a b l e > v a l u e s, Context c o n t e x t ) throws IOException, InterruptedException { i n t sum = 0 ; f o r ( I n t W r i t a b l e v a l : v a l u e s ) { sum += v a l. g e t ( ) ; } r e s u l t. s e t ( sum ) ; c o n t e x t. w r i t e ( key, r e s u l t ) ; } } public s t a t i c void main ( String [ ] args ) throws Exception { C o n f i g u r a t i o n conf = new C o n f i g u r a t i o n ( ) ; Job job = Job. getinstance ( conf, word count ) ; j o b. s e t J a r B y C l a s s ( WordCount. c l a s s ) ; j o b. s e t M a p p e r C l a s s ( TokenizerMapper. c l a s s ) ; j o b. s e t C o m b i n e r C l a s s ( IntSumReducer. c l a s s ) ; j o b. s e t R e d u c e r C l a s s ( IntSumReducer. c l a s s ) ; j o b. s e t O u t p u t K e y C l a s s ( Text. c l a s s ) ; j o b. s e t O u t p u t V a l u e C l a s s ( I n t W r i t a b l e. c l a s s ) ; F i l e I n p u t F o r m a t. addinputpath ( job, new Path ( a r g s [ 0 ] ) ) ; F i l e O u t p u t F o r m a t. setoutputpath ( job, new Path ( a r g s [ 1 ] ) ) ; System. e x i t ( j o b. w a i t F o r C o m p l e t i o n ( t r u e )? 0 : 1) ; } } 26 / 54

72 WordCount in Spark The same code is much simpler in Spark To run the Spark shell type:./bin/spark-shell The code v a l t e x t F i l e = s c. t e x t F i l e ( / data / a l l b i b l e. t x t ) v a l c o u n t s = ( t e x t F i l e. f l atmap ( l i n e => l i n e. s p l i t ( ) ). map( word => ( word, 1) ). reducebykey ( + ) ) c o u n t s. s a v e A s T e x t F i l e ( / data / a l l b i b l e c o u n t s. t x t ) Alternatively: v a l wordcounts = t e x t F i l e. f latmap ( l i n e => l i n e. s p l i t ( ) ). groupbykey ( i d e n t i t y ). count ( ) 27 / 54

73 Outline 1 Motivation 2 MapReduce 3 Spark 4 Algorithms in Map-Reduce 5 Summary 28 / 54

74 Algorithms in Map-Reduce How to implement fundamental algorithms in MapReduce? Matrix-vector multiplication. Relational-Algebra Operations. Matrix multiplication. 29 / 54

75 Matrix-vector Multiplication Let A to be large n m matrix, and x a long vector of size m. The matrix-vector multiplication is defined as: 30 / 54

76 Matrix-vector Multiplication Let A to be large n m matrix, and x a long vector of size m. The matrix-vector multiplication is defined as: Ax = v, where v = (v 1,..., v n ) and m v i = a ij x j. j=1 30 / 54

77 Matrix-vector multiplication Let us first assume that m is large, but not so large that vector x cannot fit in main memory, and be part of the input to every Map task. The matrix A is stored with explicit coordinates, as a triple (i, j, a ij ). We also assume the position of element x j in the vector x will be stored in the analogous way. 31 / 54

78 Matrix-vector multiplication Map: 32 / 54

79 Matrix-vector multiplication Map: each map task will take the entire vector x and a chunk of the matrix A. From each matrix element a ij it produces the key-value pair (i, a ij x j ). Thus, all terms of the sum that make up the component v i of the matrix-vector product will get the same key. 32 / 54

80 Matrix-vector multiplication Map: each map task will take the entire vector x and a chunk of the matrix A. From each matrix element a ij it produces the key-value pair (i, a ij x j ). Thus, all terms of the sum that make up the component v i of the matrix-vector product will get the same key. Reduce: 32 / 54

81 Matrix-vector multiplication Map: each map task will take the entire vector x and a chunk of the matrix A. From each matrix element a ij it produces the key-value pair (i, a ij x j ). Thus, all terms of the sum that make up the component v i of the matrix-vector product will get the same key. Reduce: a reduce task has simply to sum all the values associated with a given key i. The result will be a pair (i, v i ) where: v i = m a ij x j. j=1 32 / 54

82 Matrix-vector multiplication in Spark The Spark code is quite simple: v a l x = s c. t e x t F i l e ( / data / x. t x t ). map( l i n e => { v a l t = l i n e. s p l i t (, ) ; ( t ( 0 ). t r i m. t o I n t, t ( 1 ). t r i m. todouble ) }) v a l v e c t o r X = x. map{case ( i, v ) => v }. c o l l e c t v a l b r o a d c a s t e d X = s c. b r o a d c a s t ( v e c t o r X ) v a l m a t r i x = s c. t e x t F i l e ( / data /M. t x t ). map( l i n e => { v a l t = l i n e. s p l i t (, ) ; ( t ( 0 ). t r i m. t o I n t, t ( 1 ). t r i m. t o I n t, t ( 2 ). t r i m. todouble ) }) v a l v = m a t r i x. map { case ( i, j, a ) => ( i, a b r o a d c a s t e d X. v a l u e ( j 1)) }. reducebykey ( + ) v. todf. orderby ( 1 ). show 33 / 54

83 Matrix-Vector Multiplication with Large Vector v 34 / 54

84 Matrix-Vector Multiplication with Large Vector v Divide the matrix into vertical stripes of equal width and divide the vector into an equal number of horizontal stripes, of the same height. The ith stripe of the matrix multiplies only components from the ith stripe of the vector. Thus, we can divide the matrix into one file for each stripe, and do the same for the vector. 34 / 54

85 Matrix-Vector Multiplication with Large Vector v Each Map task is assigned a chunk from one the stripes of the matrix and gets the entire corresponding stripe of the vector. The Map and Reduce tasks can then act exactly as in the case where Map tasks get the entire vector. 35 / 54

86 Relational-Algebra Operations Example (Relation Links) From To url1 url2 url1 url3 url2 url3 url2 url / 54

87 Relational-Algebra Operations Selection Projection Union, Intersection, and Difference Natural Join Grouping and Aggregation 37 / 54

88 Relational-Algebra Operations R, S - relation t, t - a tuple C - a condition of selection A, B, C - subset of attributes a, b, c - attribute values for a given subset of attributes 38 / 54

89 Selection Map: 39 / 54

90 Selection Map: For each tuple t in R, test if it satisfies C. If so, produce the key-value pair (t, t). That is, both the key and value are t. Reduce: 39 / 54

91 Selection Map: For each tuple t in R, test if it satisfies C. If so, produce the key-value pair (t, t). That is, both the key and value are t. Reduce: The Reduce function is the identity. It simply passes each key-value pair to the output. 39 / 54

92 Projection 40 / 54

93 Projection Map: 40 / 54

94 Projection Map: For each tuple t in R, construct a tuple t by eliminating from t those components whose attributes are not in A. Output the key-value pair (t, t ). Reduce: 40 / 54

95 Projection Map: For each tuple t in R, construct a tuple t by eliminating from t those components whose attributes are not in A. Output the key-value pair (t, t ). Reduce: For each key t produced by any of the Map tasks, there will be one or more key-value pairs (t, t ). The Reduce function turns (t, [t, t,..., t ]) into (t, t ), so it produces exactly one pair (t, t ) for this key t. 40 / 54

96 Union Map: 41 / 54

97 Union Map: Turn each input tuple t either from relation R or S into a key-value pair (t, t). Reduce: 41 / 54

98 Union Map: Turn each input tuple t either from relation R or S into a key-value pair (t, t). Reduce: Associated with each key t there will be either one or two values. Produce output (t, t) in either case. 41 / 54

99 Intersection Map: 42 / 54

100 Intersection Map: Turn each input tuple t either from relation R or S into a key-value pair (t, t). Reduce: 42 / 54

101 Intersection Map: Turn each input tuple t either from relation R or S into a key-value pair (t, t). Reduce: If key t has value list [t, t], then produce (t, t). Otherwise, produce nothing. 42 / 54

102 Minus Map: 43 / 54

103 Minus Map: For a tuple t in R, produce key-value pair (t, name(r)), and for a tuple t in S, produce key-value pair (t, name(s)). Reduce: 43 / 54

104 Minus Map: For a tuple t in R, produce key-value pair (t, name(r)), and for a tuple t in S, produce key-value pair (t, name(s)). Reduce: For each key t, do the following. 1 If the associated value list is [name(r)], then produce (t, t). 2 If the associated value list is anything else, which could only be [name(r), name(s)], [name(s), name(r)], or [name(s)], produce nothing. 43 / 54

105 Natural Join Let us assume that we join relation R(A, B) with relation S(B, C) that share the same attribute B. Map: 44 / 54

106 Natural Join Let us assume that we join relation R(A, B) with relation S(B, C) that share the same attribute B. Map: For each tuple (a, b) of R, produce the key-value pair (b, (name(r), a)). For each tuple (b, c) of S, produce the key-value pair (b, (name(s), c)). Reduce: 44 / 54

107 Natural Join Let us assume that we join relation R(A, B) with relation S(B, C) that share the same attribute B. Map: For each tuple (a, b) of R, produce the key-value pair (b, (name(r), a)). For each tuple (b, c) of S, produce the key-value pair (b, (name(s), c)). Reduce: Each key value b will be associated with a list of pairs that are either of the form (name(r), a) or (name(s), c). Construct all pairs consisting of one with first component name(r) and the other with first component S, say (name(r), a) and (name(s), c). The output for key b is (b, [(a1, b, c1), (a2, b, c2),...]), that is, b associated with the list of tuples that can be formed from an R-tuple and an S-tuple with a common b value. 44 / 54

108 Grouping and Aggregation Let assume that we group a relation R(A, B, C) by attributes A and aggregate values of B. Map: 45 / 54

109 Grouping and Aggregation Let assume that we group a relation R(A, B, C) by attributes A and aggregate values of B. Map: For each tuple (a, b, c) produce the key-value pair (a, b). Reduce: 45 / 54

110 Grouping and Aggregation Let assume that we group a relation R(A, B, C) by attributes A and aggregate values of B. Map: For each tuple (a, b, c) produce the key-value pair (a, b). Reduce: Each key a represents a group. Apply the aggregation operator θ to the list [b 1, b 2,..., b n ] of B-values associated with key a. The output is the pair (a, x), where x is the result of applying θ to the list. For example, if θ is SUM, then x = b 1 + b b n, and if θ is MAX, then x is the largest of b 1, b 2,..., b n. 45 / 54

111 Matrix Multiplication If M is a matrix with element m ij in row i and column j, and N is a matrix with element n jk in row j and column k, then the product: P = MN is the matrix P with element p ik in row i and column k, where: pik = 46 / 54

112 Matrix Multiplication If M is a matrix with element m ij in row i and column j, and N is a matrix with element n jk in row j and column k, then the product: P = MN is the matrix P with element p ik in row i and column k, where: pik = j m ij n jk 46 / 54

113 Matrix Multiplication We can think of a matrix M and N as a relation with three attributes: the row number, the column number, and the value in that row and column, i.e.,: M(I, J, V ) and N(J, K, W ) with the following tuples, respectively: (i, j, m ij ) and (j, k, n jk ). In case of sparsity of M and N, this relational representation is very efficient in terms of space. The product MN is almost a natural join followed by grouping and aggregation. 47 / 54

114 Matrix Multiplication 48 / 54

115 Matrix Multiplication Map: 48 / 54

116 Matrix Multiplication Map: Send each matrix element m ij to the key value pair: (j, (M, i, m ij )). Analogously, send each matrix element n jk to the key value pair: (j, (N, k, n jk )). Reduce: 48 / 54

117 Matrix Multiplication Map: Send each matrix element m ij to the key value pair: (j, (M, i, m ij )). Analogously, send each matrix element n jk to the key value pair: (j, (N, k, n jk )). Reduce: For each key j, examine its list of associated values. For each value that comes from M, say (M, i, m ij ), and each value that comes from N, say (N, k, n jk ), produce the tuple (i, k, v = m ij n jk ), The output of the Reduce function is a key j paired with the list of all the tuples of this form that we get from j: (j, [(i 1, k 1, v 1 ), (i 2, k 2, v 2 ),..., (i p, k p, v p )]). 48 / 54

118 Matrix Multiplication 49 / 54

119 Matrix Multiplication Map: 49 / 54

120 Matrix Multiplication Map: From the pairs that are output from the previous Reduce function produce p key-value pairs: Reduce: ((i 1, k 1 ), v 1 ), ((i 2, k 2 ), v 2 ),..., ((i p, k p ), v p ). 49 / 54

121 Matrix Multiplication Map: From the pairs that are output from the previous Reduce function produce p key-value pairs: ((i 1, k 1 ), v 1 ), ((i 2, k 2 ), v 2 ),..., ((i p, k p ), v p ). Reduce: For each key (i, k), produce the sum of the list of values associated with this key. The result is a pair ((i, k), v), where v is the value of the element in row i and column k of the matrix P = MN. 49 / 54

122 Matrix Multiplication with One Map-Reduce Step Map: 50 / 54

123 Matrix Multiplication with One Map-Reduce Step Map: For each element m ij of M, produce a key-value pair ((i, k), (M, j, m ij )), for k = 1, 2,..., up to the number of columns of N. Also, for each element n jk of N, produce a key-value pair ((i, k), (N, j, n jk )), for i = 1, 2,..., up to the number of rows of M. 50 / 54

124 Matrix Multiplication with One Map-Reduce Step Reduce: 51 / 54

125 Matrix Multiplication with One Map-Reduce Step Reduce: Each key (i, k) will have an associated list with all the values (M, j, m ij ) and (N, j, n jk ), for all possible values of j. We connect the two values on the list that have the same value of j, for each j: We sort by j the values that begin with M and sort by j the values that begin with N, in separate lists, The jth values on each list must have their third components, mij and n jk extracted and multiplied, Then, these products are summed and the result is paired with (i, k) in the output of the Reduce function. 51 / 54

126 Outline 1 Motivation 2 MapReduce 3 Spark 4 Algorithms in Map-Reduce 5 Summary 52 / 54

127 Summary Computational burden data partitioning, distributed systems. New data-intensive challenges like search engines. MapReduce: The overall idea and simple algorithms. Spark: MapReduce in practice. Algorithms Using Map-Reduce Relational-Algebra Operations, Matrix multiplication. 53 / 54

128 Bibliography A. Rajaraman and J. D. Ullman. Mining of Massive Datasets. Cambridge University Press, J.Lin and Ch. Dyer. Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers, Ch. Lam. Hadoop in Action. Manning Publications Co., / 54

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second semester