MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

Size: px
Start display at page:

Download "MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland"

Transcription

1 MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first semester Academic year 2017/18 (winter course) 1 / 54

2 Review of the previous lectures Mining of massive datasets. Evolution of database systems. Dimensional modeling. ETL and OLAP systems: Extraction, transformation, load. ROLAP, MOLAP, HOLAP. Challenges in OLAP systems: a huge number of possible aggregations to compute. 2 / 54

3 Outline 1 Motivation 2 MapReduce 3 Spark 4 Algorithms in Map-Reduce 5 Summary 3 / 54

4 Outline 1 Motivation 2 MapReduce 3 Spark 4 Algorithms in Map-Reduce 5 Summary 4 / 54

5 Motivation Traditional DBMS vs. NoSQL 5 / 54

6 Motivation Traditional DBMS vs. NoSQL New emerging applications: search engines, social networks, online shopping, online advertising, recommender systems, etc. 5 / 54

7 Motivation Traditional DBMS vs. NoSQL New emerging applications: search engines, social networks, online shopping, online advertising, recommender systems, etc. New computational challenges: WordCount, PageRank, etc. 5 / 54

8 Motivation Traditional DBMS vs. NoSQL New emerging applications: search engines, social networks, online shopping, online advertising, recommender systems, etc. New computational challenges: WordCount, PageRank, etc. New ideas: 5 / 54

9 Motivation Traditional DBMS vs. NoSQL New emerging applications: search engines, social networks, online shopping, online advertising, recommender systems, etc. New computational challenges: WordCount, PageRank, etc. New ideas: Scaling-out instead of scaling-up 5 / 54

10 Motivation Traditional DBMS vs. NoSQL New emerging applications: search engines, social networks, online shopping, online advertising, recommender systems, etc. New computational challenges: WordCount, PageRank, etc. New ideas: Scaling-out instead of scaling-up Move-code-to-data 5 / 54

11 Outline 1 Motivation 2 MapReduce 3 Spark 4 Algorithms in Map-Reduce 5 Summary 6 / 54

12 MapReduce-based systems Accessible run on large clusters of commodity machines or on cloud computing services such as Amazon s Elastic Compute Cloud (EC2). 7 / 54

13 MapReduce-based systems Accessible run on large clusters of commodity machines or on cloud computing services such as Amazon s Elastic Compute Cloud (EC2). Robust are intended to run on commodity hardware; desinged with the assumption of frequent hardware malfunctions; they can gracefully handle most such failures. 7 / 54

14 MapReduce-based systems Accessible run on large clusters of commodity machines or on cloud computing services such as Amazon s Elastic Compute Cloud (EC2). Robust are intended to run on commodity hardware; desinged with the assumption of frequent hardware malfunctions; they can gracefully handle most such failures. Scalable scales linearly to handle larger data by adding more nodes to the cluster. 7 / 54

15 MapReduce-based systems Accessible run on large clusters of commodity machines or on cloud computing services such as Amazon s Elastic Compute Cloud (EC2). Robust are intended to run on commodity hardware; desinged with the assumption of frequent hardware malfunctions; they can gracefully handle most such failures. Scalable scales linearly to handle larger data by adding more nodes to the cluster. Simple allow users to quickly write efficient parallel code. 7 / 54

16 MapReduce: Two simple procedures Word count: A basic operation for every search engine. Matrix-vector multiplication: A fundamental step in many algorithms, for example, in PageRank. 8 / 54

17 MapReduce: Two simple procedures Word count: A basic operation for every search engine. Matrix-vector multiplication: A fundamental step in many algorithms, for example, in PageRank. How to implement these procedures for efficient execution in a distributed system? 8 / 54

18 MapReduce: Two simple procedures Word count: A basic operation for every search engine. Matrix-vector multiplication: A fundamental step in many algorithms, for example, in PageRank. How to implement these procedures for efficient execution in a distributed system? How much can we gain by such implementation? 8 / 54

19 MapReduce: Two simple procedures Word count: A basic operation for every search engine. Matrix-vector multiplication: A fundamental step in many algorithms, for example, in PageRank. How to implement these procedures for efficient execution in a distributed system? How much can we gain by such implementation? Let us focus on the word count problem... 8 / 54

20 Word count Count the number of times each word occurs in a set of documents: Do as I say, not as I do. Word Count as 2 do 2 i 2 not 1 say 1 9 / 54

21 Word count Let us write the procedure in pseudo-code for a single machine: 10 / 54

22 Word count Let us write the procedure in pseudo-code for a single machine: d e f i n e wordcount as M u l t i s e t ; f o r each document i n documentset { T = t o k e n i z e ( document ) ; } f o r each token i n T { wordcount [ token ]++; } d i s p l a y ( wordcount ) ; 10 / 54

23 Word count Let us write the procedure in pseudo-code for many machines: 11 / 54

24 Word count Let us write the procedure in pseudo-code for many machines: First step: d e f i n e wordcount as M u l t i s e t ; f o r each document i n documentsubset { T = t o k e n i z e ( document ) ; f o r each token i n T { wordcount [ token ]++; } } sendtosecondphase ( wordcount ) ; 11 / 54

25 Word count Let us write the procedure in pseudo-code for many machines: First step: d e f i n e wordcount as M u l t i s e t ; f o r each document i n documentsubset { T = t o k e n i z e ( document ) ; f o r each token i n T { wordcount [ token ]++; } } sendtosecondphase ( wordcount ) ; Second step: d e f i n e totalwordcount as M u l t i s e t ; f o r each wordcount r e c e i v e d from f i r s t P h a s e { m u l t i s e t A d d ( totalwordcount, wordcount ) ; } 11 / 54

26 Word count To make the procedure work properly across a cluster of distributed machines, we need to add a number of functionalities: Store files over many processing machines (of phase one). Write a disk-based hash table permitting processing without being limited by RAM capacity. Partition the intermediate data (that is, wordcount) from phase one. Shuffle the partitions to the appropriate machines in phase two. Ensure fault tolerance. 12 / 54

27 MapReduce MapReduce programs are executed in two main phases, called mapping and reducing: 13 / 54

28 MapReduce MapReduce programs are executed in two main phases, called mapping and reducing: Map: the map function is written to convert input elements to key-value pairs. 13 / 54

29 MapReduce MapReduce programs are executed in two main phases, called mapping and reducing: Map: the map function is written to convert input elements to key-value pairs. Reduce: the reduce function is written to take pairs consisting of a key and its list of associated values and combine those values in some way. 13 / 54

30 MapReduce The complete data flow: Input Output map (<k1, v1>) list(<k2, v2>) reduce (<k2, list(<v2>) list(<k3, v3>) 14 / 54

31 MapReduce Figure: The complete data flow 15 / 54

32 MapReduce The complete data flow: 16 / 54

33 MapReduce The complete data flow: The input is structured as a list of key-value pairs: list(<k1,v1>). 16 / 54

34 MapReduce The complete data flow: The input is structured as a list of key-value pairs: list(<k1,v1>). The list of key-value pairs is broken up and each individual key-value pair, <k1,v1>, is processed by calling the map function of the mapper (the key k1 is often ignored by the mapper). 16 / 54

35 MapReduce The complete data flow: The input is structured as a list of key-value pairs: list(<k1,v1>). The list of key-value pairs is broken up and each individual key-value pair, <k1,v1>, is processed by calling the map function of the mapper (the key k1 is often ignored by the mapper). The mapper transforms each <k1,v1> pair into a list of <k2,v2> pairs. 16 / 54

36 MapReduce The complete data flow: The input is structured as a list of key-value pairs: list(<k1,v1>). The list of key-value pairs is broken up and each individual key-value pair, <k1,v1>, is processed by calling the map function of the mapper (the key k1 is often ignored by the mapper). The mapper transforms each <k1,v1> pair into a list of <k2,v2> pairs. The key-value pairs are processed in arbitrary order. 16 / 54

37 MapReduce The complete data flow: The input is structured as a list of key-value pairs: list(<k1,v1>). The list of key-value pairs is broken up and each individual key-value pair, <k1,v1>, is processed by calling the map function of the mapper (the key k1 is often ignored by the mapper). The mapper transforms each <k1,v1> pair into a list of <k2,v2> pairs. The key-value pairs are processed in arbitrary order. The output of all the mappers are (conceptually) aggregated into one giant list of <k2,v2> pairs. All pairs sharing the same k2 are grouped together into a new aggregated key-value pair: <k2,list(v2)>. 16 / 54

38 MapReduce The complete data flow: The input is structured as a list of key-value pairs: list(<k1,v1>). The list of key-value pairs is broken up and each individual key-value pair, <k1,v1>, is processed by calling the map function of the mapper (the key k1 is often ignored by the mapper). The mapper transforms each <k1,v1> pair into a list of <k2,v2> pairs. The key-value pairs are processed in arbitrary order. The output of all the mappers are (conceptually) aggregated into one giant list of <k2,v2> pairs. All pairs sharing the same k2 are grouped together into a new aggregated key-value pair: <k2,list(v2)>. The framework asks the reducer to process each one of these aggregated key-value pairs individually. 16 / 54

39 WordCount in MapReduce Map: For a pair <k1,document> produce a sequence of pairs <token,1>, where token is a token/word found in the document. map( S t r i n g f i l e n a m e, S t r i n g document ) { L i s t <S t r i n g > T = t o k e n i z e ( document ) ; } f o r each token i n T { emit ( ( S t r i n g ) token, ( I n t e g e r ) 1) ; } 17 / 54

40 WordCount in MapReduce Reduce For a pair <word, list(1, 1,..., 1)> sum up all ones appearing in the list and return <word, sum>, where sum is the sum of ones. r e d u c e ( S t r i n g token, L i s t <I n t e g e r > v a l u e s ) { I n t e g e r sum = 0 ; f o r each v a l u e i n v a l u e s { sum = sum + v a l u e ; } } emit ( ( S t r i n g ) token, ( I n t e g e r ) sum ) ; 18 / 54

41 Combiner and partitioner Beside map and reduce there are two other important elements that can be implemented within the MapReduce framework to control the data flow. 19 / 54

42 Combiner and partitioner Beside map and reduce there are two other important elements that can be implemented within the MapReduce framework to control the data flow. Combiner perform local aggregation (the reduce step) on the map node. 19 / 54

43 Combiner and partitioner Beside map and reduce there are two other important elements that can be implemented within the MapReduce framework to control the data flow. Combiner perform local aggregation (the reduce step) on the map node. Partitioner divide the key space of the map output and assign the key-value pairs to reducers. 19 / 54

44 Outline 1 Motivation 2 MapReduce 3 Spark 4 Algorithms in Map-Reduce 5 Summary 20 / 54

45 Spark Spark is a fast and general-purpose cluster computing system. 21 / 54

46 Spark Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. 21 / 54

47 Spark Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including: 21 / 54

48 Spark Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including: Spark SQL for SQL and structured data processing, 21 / 54

49 Spark Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including: Spark SQL for SQL and structured data processing, MLlib for machine learning, 21 / 54

50 Spark Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including: Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, 21 / 54

51 Spark Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including: Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. 21 / 54

52 Spark Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including: Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. For more check 21 / 54

53 Spark Spark uses Hadoop which is a popular open-source implementation of MapReduce. 22 / 54

54 Spark Spark uses Hadoop which is a popular open-source implementation of MapReduce. Hadoop works in a master/slave architecture for both distributed storage and distributed computation. 22 / 54

55 Spark Spark uses Hadoop which is a popular open-source implementation of MapReduce. Hadoop works in a master/slave architecture for both distributed storage and distributed computation. Hadoop Distributed File System (HDFS) is responsible for distributed storage. 22 / 54

56 Installation of Spark Download Spark from 23 / 54

57 Installation of Spark Download Spark from 23 / 54

58 Installation of Spark Download Spark from Untar the spark archive: 23 / 54

59 Installation of Spark Download Spark from Untar the spark archive: tar xvfz spark bin-hadoop2.7.tar 23 / 54

60 Installation of Spark Download Spark from Untar the spark archive: tar xvfz spark bin-hadoop2.7.tar To play with Spark there is no need to install HDFS / 54

61 Installation of Spark Download Spark from Untar the spark archive: tar xvfz spark bin-hadoop2.7.tar To play with Spark there is no need to install HDFS... But, you can try to play around with HDFS. 23 / 54

62 HDFS Create new directories: 24 / 54

63 HDFS Create new directories: hdfs dfs -mkdir /user 24 / 54

64 HDFS Create new directories: hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mynane 24 / 54

65 HDFS Create new directories: hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mynane Copy the input files into the distributed filesystem: 24 / 54

66 HDFS Create new directories: hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mynane Copy the input files into the distributed filesystem: hdfs dfs -put data.txt /user/myname/data.txt 24 / 54

67 HDFS Create new directories: hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mynane Copy the input files into the distributed filesystem: hdfs dfs -put data.txt /user/myname/data.txt View the files in the distributed filesystem: 24 / 54

68 HDFS Create new directories: hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mynane Copy the input files into the distributed filesystem: hdfs dfs -put data.txt /user/myname/data.txt View the files in the distributed filesystem: hdfs dfs -ls /user/myname/ 24 / 54

69 HDFS Create new directories: hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mynane Copy the input files into the distributed filesystem: hdfs dfs -put data.txt /user/myname/data.txt View the files in the distributed filesystem: hdfs dfs -ls /user/myname/ hdfs dfs -cat /user/myname/data.txt 24 / 54

70 WordCount in Hadoop import ja va. i o. IOException ; import j a v a. u t i l. S t r i n g T o k e n i z e r ; import org. apache. hadoop. c o n f. C o n f i g u r a t i o n ; import org. apache. hadoop. f s. Path ; import org. apache. hadoop. i o. I n t W r i t a b l e ; import org. apache. hadoop. i o. Text ; import org. apache. hadoop. mapreduce. Job ; import org. apache. hadoop. mapreduce. Mapper ; import org. apache. hadoop. mapreduce. Reducer ; import org. apache. hadoop. mapreduce. l i b. i n p u t. F i l e I n p u t F o r m a t ; import org. apache. hadoop. mapreduce. l i b. o u tput. F i l e O u t p u t F o r m a t ; p u b l i c c l a s s WordCount { p u b l i c s t a t i c c l a s s TokenizerMapper extends Mapper<Object, Text, Text, IntWritable >{ p r i v a t e f i n a l s t a t i c I n t W r i t a b l e one = new I n t W r i t a b l e ( 1 ) ; p r i v a t e Text word = new Text ( ) ; p u b l i c v o i d map( Object key, Text value, Context context ) throws IOException, InterruptedException { S t r i n g T o k e n i z e r i t r = new S t r i n g T o k e n i z e r ( v a l u e. t o S t r i n g ( ) ) ; w h i l e ( i t r. hasmoretokens ( ) ) { word. s e t ( i t r. nexttoken ( ) ) ; c o n t e x t. w r i t e ( word, one ) ; } } } (... ) 25 / 54

71 WordCount in Hadoop (... ) p u b l i c s t a t i c c l a s s IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { p r i v a t e IntWritable r e s u l t = new IntWritable ( ) ; p u b l i c v o i d r e d u c e ( Text key, I t e r a b l e <I n t W r i t a b l e > v a l u e s, Context c o n t e x t ) throws IOException, InterruptedException { i n t sum = 0 ; f o r ( I n t W r i t a b l e v a l : v a l u e s ) { sum += v a l. g e t ( ) ; } r e s u l t. s e t ( sum ) ; c o n t e x t. w r i t e ( key, r e s u l t ) ; } } public s t a t i c void main ( String [ ] args ) throws Exception { C o n f i g u r a t i o n conf = new C o n f i g u r a t i o n ( ) ; Job job = Job. getinstance ( conf, word count ) ; j o b. s e t J a r B y C l a s s ( WordCount. c l a s s ) ; j o b. s e t M a p p e r C l a s s ( TokenizerMapper. c l a s s ) ; j o b. s e t C o m b i n e r C l a s s ( IntSumReducer. c l a s s ) ; j o b. s e t R e d u c e r C l a s s ( IntSumReducer. c l a s s ) ; j o b. s e t O u t p u t K e y C l a s s ( Text. c l a s s ) ; j o b. s e t O u t p u t V a l u e C l a s s ( I n t W r i t a b l e. c l a s s ) ; F i l e I n p u t F o r m a t. addinputpath ( job, new Path ( a r g s [ 0 ] ) ) ; F i l e O u t p u t F o r m a t. setoutputpath ( job, new Path ( a r g s [ 1 ] ) ) ; System. e x i t ( j o b. w a i t F o r C o m p l e t i o n ( t r u e )? 0 : 1) ; } } 26 / 54

72 WordCount in Spark The same code is much simpler in Spark To run the Spark shell type:./bin/spark-shell The code v a l t e x t F i l e = s c. t e x t F i l e ( / data / a l l b i b l e. t x t ) v a l c o u n t s = ( t e x t F i l e. f l atmap ( l i n e => l i n e. s p l i t ( ) ). map( word => ( word, 1) ). reducebykey ( + ) ) c o u n t s. s a v e A s T e x t F i l e ( / data / a l l b i b l e c o u n t s. t x t ) Alternatively: v a l wordcounts = t e x t F i l e. f latmap ( l i n e => l i n e. s p l i t ( ) ). groupbykey ( i d e n t i t y ). count ( ) 27 / 54

73 Outline 1 Motivation 2 MapReduce 3 Spark 4 Algorithms in Map-Reduce 5 Summary 28 / 54

74 Algorithms in Map-Reduce How to implement fundamental algorithms in MapReduce? Matrix-vector multiplication. Relational-Algebra Operations. Matrix multiplication. 29 / 54

75 Matrix-vector Multiplication Let A to be large n m matrix, and x a long vector of size m. The matrix-vector multiplication is defined as: 30 / 54

76 Matrix-vector Multiplication Let A to be large n m matrix, and x a long vector of size m. The matrix-vector multiplication is defined as: Ax = v, where v = (v 1,..., v n ) and m v i = a ij x j. j=1 30 / 54

77 Matrix-vector multiplication Let us first assume that m is large, but not so large that vector x cannot fit in main memory, and be part of the input to every Map task. The matrix A is stored with explicit coordinates, as a triple (i, j, a ij ). We also assume the position of element x j in the vector x will be stored in the analogous way. 31 / 54

78 Matrix-vector multiplication Map: 32 / 54

79 Matrix-vector multiplication Map: each map task will take the entire vector x and a chunk of the matrix A. From each matrix element a ij it produces the key-value pair (i, a ij x j ). Thus, all terms of the sum that make up the component v i of the matrix-vector product will get the same key. 32 / 54

80 Matrix-vector multiplication Map: each map task will take the entire vector x and a chunk of the matrix A. From each matrix element a ij it produces the key-value pair (i, a ij x j ). Thus, all terms of the sum that make up the component v i of the matrix-vector product will get the same key. Reduce: 32 / 54

81 Matrix-vector multiplication Map: each map task will take the entire vector x and a chunk of the matrix A. From each matrix element a ij it produces the key-value pair (i, a ij x j ). Thus, all terms of the sum that make up the component v i of the matrix-vector product will get the same key. Reduce: a reduce task has simply to sum all the values associated with a given key i. The result will be a pair (i, v i ) where: v i = m a ij x j. j=1 32 / 54

82 Matrix-vector multiplication in Spark The Spark code is quite simple: v a l x = s c. t e x t F i l e ( / data / x. t x t ). map( l i n e => { v a l t = l i n e. s p l i t (, ) ; ( t ( 0 ). t r i m. t o I n t, t ( 1 ). t r i m. todouble ) }) v a l v e c t o r X = x. map{case ( i, v ) => v }. c o l l e c t v a l b r o a d c a s t e d X = s c. b r o a d c a s t ( v e c t o r X ) v a l m a t r i x = s c. t e x t F i l e ( / data /M. t x t ). map( l i n e => { v a l t = l i n e. s p l i t (, ) ; ( t ( 0 ). t r i m. t o I n t, t ( 1 ). t r i m. t o I n t, t ( 2 ). t r i m. todouble ) }) v a l v = m a t r i x. map { case ( i, j, a ) => ( i, a b r o a d c a s t e d X. v a l u e ( j 1)) }. reducebykey ( + ) v. todf. orderby ( 1 ). show 33 / 54

83 Matrix-Vector Multiplication with Large Vector v 34 / 54

84 Matrix-Vector Multiplication with Large Vector v Divide the matrix into vertical stripes of equal width and divide the vector into an equal number of horizontal stripes, of the same height. The ith stripe of the matrix multiplies only components from the ith stripe of the vector. Thus, we can divide the matrix into one file for each stripe, and do the same for the vector. 34 / 54

85 Matrix-Vector Multiplication with Large Vector v Each Map task is assigned a chunk from one the stripes of the matrix and gets the entire corresponding stripe of the vector. The Map and Reduce tasks can then act exactly as in the case where Map tasks get the entire vector. 35 / 54

86 Relational-Algebra Operations Example (Relation Links) From To url1 url2 url1 url3 url2 url3 url2 url / 54

87 Relational-Algebra Operations Selection Projection Union, Intersection, and Difference Natural Join Grouping and Aggregation 37 / 54

88 Relational-Algebra Operations R, S - relation t, t - a tuple C - a condition of selection A, B, C - subset of attributes a, b, c - attribute values for a given subset of attributes 38 / 54

89 Selection Map: 39 / 54

90 Selection Map: For each tuple t in R, test if it satisfies C. If so, produce the key-value pair (t, t). That is, both the key and value are t. Reduce: 39 / 54

91 Selection Map: For each tuple t in R, test if it satisfies C. If so, produce the key-value pair (t, t). That is, both the key and value are t. Reduce: The Reduce function is the identity. It simply passes each key-value pair to the output. 39 / 54

92 Projection 40 / 54

93 Projection Map: 40 / 54

94 Projection Map: For each tuple t in R, construct a tuple t by eliminating from t those components whose attributes are not in A. Output the key-value pair (t, t ). Reduce: 40 / 54

95 Projection Map: For each tuple t in R, construct a tuple t by eliminating from t those components whose attributes are not in A. Output the key-value pair (t, t ). Reduce: For each key t produced by any of the Map tasks, there will be one or more key-value pairs (t, t ). The Reduce function turns (t, [t, t,..., t ]) into (t, t ), so it produces exactly one pair (t, t ) for this key t. 40 / 54

96 Union Map: 41 / 54

97 Union Map: Turn each input tuple t either from relation R or S into a key-value pair (t, t). Reduce: 41 / 54

98 Union Map: Turn each input tuple t either from relation R or S into a key-value pair (t, t). Reduce: Associated with each key t there will be either one or two values. Produce output (t, t) in either case. 41 / 54

99 Intersection Map: 42 / 54

100 Intersection Map: Turn each input tuple t either from relation R or S into a key-value pair (t, t). Reduce: 42 / 54

101 Intersection Map: Turn each input tuple t either from relation R or S into a key-value pair (t, t). Reduce: If key t has value list [t, t], then produce (t, t). Otherwise, produce nothing. 42 / 54

102 Minus Map: 43 / 54

103 Minus Map: For a tuple t in R, produce key-value pair (t, name(r)), and for a tuple t in S, produce key-value pair (t, name(s)). Reduce: 43 / 54

104 Minus Map: For a tuple t in R, produce key-value pair (t, name(r)), and for a tuple t in S, produce key-value pair (t, name(s)). Reduce: For each key t, do the following. 1 If the associated value list is [name(r)], then produce (t, t). 2 If the associated value list is anything else, which could only be [name(r), name(s)], [name(s), name(r)], or [name(s)], produce nothing. 43 / 54

105 Natural Join Let us assume that we join relation R(A, B) with relation S(B, C) that share the same attribute B. Map: 44 / 54

106 Natural Join Let us assume that we join relation R(A, B) with relation S(B, C) that share the same attribute B. Map: For each tuple (a, b) of R, produce the key-value pair (b, (name(r), a)). For each tuple (b, c) of S, produce the key-value pair (b, (name(s), c)). Reduce: 44 / 54

107 Natural Join Let us assume that we join relation R(A, B) with relation S(B, C) that share the same attribute B. Map: For each tuple (a, b) of R, produce the key-value pair (b, (name(r), a)). For each tuple (b, c) of S, produce the key-value pair (b, (name(s), c)). Reduce: Each key value b will be associated with a list of pairs that are either of the form (name(r), a) or (name(s), c). Construct all pairs consisting of one with first component name(r) and the other with first component S, say (name(r), a) and (name(s), c). The output for key b is (b, [(a1, b, c1), (a2, b, c2),...]), that is, b associated with the list of tuples that can be formed from an R-tuple and an S-tuple with a common b value. 44 / 54

108 Grouping and Aggregation Let assume that we group a relation R(A, B, C) by attributes A and aggregate values of B. Map: 45 / 54

109 Grouping and Aggregation Let assume that we group a relation R(A, B, C) by attributes A and aggregate values of B. Map: For each tuple (a, b, c) produce the key-value pair (a, b). Reduce: 45 / 54

110 Grouping and Aggregation Let assume that we group a relation R(A, B, C) by attributes A and aggregate values of B. Map: For each tuple (a, b, c) produce the key-value pair (a, b). Reduce: Each key a represents a group. Apply the aggregation operator θ to the list [b 1, b 2,..., b n ] of B-values associated with key a. The output is the pair (a, x), where x is the result of applying θ to the list. For example, if θ is SUM, then x = b 1 + b b n, and if θ is MAX, then x is the largest of b 1, b 2,..., b n. 45 / 54

111 Matrix Multiplication If M is a matrix with element m ij in row i and column j, and N is a matrix with element n jk in row j and column k, then the product: P = MN is the matrix P with element p ik in row i and column k, where: pik = 46 / 54

112 Matrix Multiplication If M is a matrix with element m ij in row i and column j, and N is a matrix with element n jk in row j and column k, then the product: P = MN is the matrix P with element p ik in row i and column k, where: pik = j m ij n jk 46 / 54

113 Matrix Multiplication We can think of a matrix M and N as a relation with three attributes: the row number, the column number, and the value in that row and column, i.e.,: M(I, J, V ) and N(J, K, W ) with the following tuples, respectively: (i, j, m ij ) and (j, k, n jk ). In case of sparsity of M and N, this relational representation is very efficient in terms of space. The product MN is almost a natural join followed by grouping and aggregation. 47 / 54

114 Matrix Multiplication 48 / 54

115 Matrix Multiplication Map: 48 / 54

116 Matrix Multiplication Map: Send each matrix element m ij to the key value pair: (j, (M, i, m ij )). Analogously, send each matrix element n jk to the key value pair: (j, (N, k, n jk )). Reduce: 48 / 54

117 Matrix Multiplication Map: Send each matrix element m ij to the key value pair: (j, (M, i, m ij )). Analogously, send each matrix element n jk to the key value pair: (j, (N, k, n jk )). Reduce: For each key j, examine its list of associated values. For each value that comes from M, say (M, i, m ij ), and each value that comes from N, say (N, k, n jk ), produce the tuple (i, k, v = m ij n jk ), The output of the Reduce function is a key j paired with the list of all the tuples of this form that we get from j: (j, [(i 1, k 1, v 1 ), (i 2, k 2, v 2 ),..., (i p, k p, v p )]). 48 / 54

118 Matrix Multiplication 49 / 54

119 Matrix Multiplication Map: 49 / 54

120 Matrix Multiplication Map: From the pairs that are output from the previous Reduce function produce p key-value pairs: Reduce: ((i 1, k 1 ), v 1 ), ((i 2, k 2 ), v 2 ),..., ((i p, k p ), v p ). 49 / 54

121 Matrix Multiplication Map: From the pairs that are output from the previous Reduce function produce p key-value pairs: ((i 1, k 1 ), v 1 ), ((i 2, k 2 ), v 2 ),..., ((i p, k p ), v p ). Reduce: For each key (i, k), produce the sum of the list of values associated with this key. The result is a pair ((i, k), v), where v is the value of the element in row i and column k of the matrix P = MN. 49 / 54

122 Matrix Multiplication with One Map-Reduce Step Map: 50 / 54

123 Matrix Multiplication with One Map-Reduce Step Map: For each element m ij of M, produce a key-value pair ((i, k), (M, j, m ij )), for k = 1, 2,..., up to the number of columns of N. Also, for each element n jk of N, produce a key-value pair ((i, k), (N, j, n jk )), for i = 1, 2,..., up to the number of rows of M. 50 / 54

124 Matrix Multiplication with One Map-Reduce Step Reduce: 51 / 54

125 Matrix Multiplication with One Map-Reduce Step Reduce: Each key (i, k) will have an associated list with all the values (M, j, m ij ) and (N, j, n jk ), for all possible values of j. We connect the two values on the list that have the same value of j, for each j: We sort by j the values that begin with M and sort by j the values that begin with N, in separate lists, The jth values on each list must have their third components, mij and n jk extracted and multiplied, Then, these products are summed and the result is paired with (i, k) in the output of the Reduce function. 51 / 54

126 Outline 1 Motivation 2 MapReduce 3 Spark 4 Algorithms in Map-Reduce 5 Summary 52 / 54

127 Summary Computational burden data partitioning, distributed systems. New data-intensive challenges like search engines. MapReduce: The overall idea and simple algorithms. Spark: MapReduce in practice. Algorithms Using Map-Reduce Relational-Algebra Operations, Matrix multiplication. 53 / 54

128 Bibliography A. Rajaraman and J. D. Ullman. Mining of Massive Datasets. Cambridge University Press, J.Lin and Ch. Dyer. Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers, Ch. Lam. Hadoop in Action. Manning Publications Co., / 54

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second semester

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Map Reduce I Map Reduce I 1 / 32 Outline 1. Introduction 2. Parallel

More information

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 11: MapReduce Motivation Distribution makes simple computations complex Communication Load balancing Fault tolerance Not all applications

More information

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74 V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:

More information

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan ArcGIS GeoAnalytics Server: An Introduction Sarah Ambrose and Ravi Narayanan Overview Introduction Demos Analysis Concepts using GeoAnalytics Server GeoAnalytics Data Sources GeoAnalytics Server Administration

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Map-Reduce Denis Helic KTI, TU Graz Oct 24, 2013 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 1 / 82 Big picture: KDDM Probability Theory Linear Algebra

More information

Project 2: Hadoop PageRank Cloud Computing Spring 2017

Project 2: Hadoop PageRank Cloud Computing Spring 2017 Project 2: Hadoop PageRank Cloud Computing Spring 2017 Professor Judy Qiu Goal This assignment provides an illustration of PageRank algorithms and Hadoop. You will then blend these applications by implementing

More information

Distributed Architectures

Distributed Architectures Distributed Architectures Software Architecture VO/KU (707023/707024) Roman Kern KTI, TU Graz 2015-01-21 Roman Kern (KTI, TU Graz) Distributed Architectures 2015-01-21 1 / 64 Outline 1 Introduction 2 Independent

More information

Computational Frameworks. MapReduce

Computational Frameworks. MapReduce Computational Frameworks MapReduce 1 Computational challenges in data mining Computation-intensive algorithms: e.g., optimization algorithms, graph algorithms Large inputs: e.g., web, social networks data.

More information

Rainfall data analysis and storm prediction system

Rainfall data analysis and storm prediction system Rainfall data analysis and storm prediction system SHABARIRAM, M. E. Available from Sheffield Hallam University Research Archive (SHURA) at: http://shura.shu.ac.uk/15778/ This document is the author deposited

More information

Computational Frameworks. MapReduce

Computational Frameworks. MapReduce Computational Frameworks MapReduce 1 Computational complexity: Big data challenges Any processing requiring a superlinear number of operations may easily turn out unfeasible. If input size is really huge,

More information

Query Analyzer for Apache Pig

Query Analyzer for Apache Pig Imperial College London Department of Computing Individual Project: Final Report Query Analyzer for Apache Pig Author: Robert Yau Zhou 00734205 (robert.zhou12@imperial.ac.uk) Supervisor: Dr Peter McBrien

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 12: Real-Time Data Analytics (2/2) March 31, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Project 3: Hadoop Blast Cloud Computing Spring 2017

Project 3: Hadoop Blast Cloud Computing Spring 2017 Project 3: Hadoop Blast Cloud Computing Spring 2017 Professor Judy Qiu Goal By this point you should have gone over the sections concerning Hadoop Setup and a few Hadoop programs. Now you are going to

More information

Spatial Analytics Workshop

Spatial Analytics Workshop Spatial Analytics Workshop Pete Skomoroch, LinkedIn (@peteskomoroch) Kevin Weil, Twitter (@kevinweil) Sean Gorman, FortiusOne (@seangorman) #spatialanalytics Introduction The Rise of Spatial Analytics

More information

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Sam Williamson

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Sam Williamson ArcGIS Enterprise: What s New Philip Heede Shannon Kalisky Melanie Summers Sam Williamson ArcGIS Enterprise is the new name for ArcGIS for Server What is ArcGIS Enterprise ArcGIS Enterprise is powerful

More information

Visualizing Big Data on Maps: Emerging Tools and Techniques. Ilir Bejleri, Sanjay Ranka

Visualizing Big Data on Maps: Emerging Tools and Techniques. Ilir Bejleri, Sanjay Ranka Visualizing Big Data on Maps: Emerging Tools and Techniques Ilir Bejleri, Sanjay Ranka Topics Web GIS Visualization Big Data GIS Performance Maps in Data Visualization Platforms Next: Web GIS Visualization

More information

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions CS246: Mining Massive Data Sets Winter 2017 Problem Set 2 Due 11:59pm February 9, 2017 Only one late period is allowed for this homework (11:59pm 2/14). General Instructions Submission instructions: These

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 9: Real-Time Data Analytics (2/2) March 29, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Research Collection. JSONiq on Spark. Master Thesis. ETH Library. Author(s): Irimescu, Stefan. Publication Date: 2018

Research Collection. JSONiq on Spark. Master Thesis. ETH Library. Author(s): Irimescu, Stefan. Publication Date: 2018 Research Collection Master Thesis JSONiq on Spark Author(s): Irimescu, Stefan Publication Date: 2018 Permanent Link: https://doi.org/10.3929/ethz-b-000272416 Rights / License: In Copyright - Non-Commercial

More information

Multi-join Query Evaluation on Big Data Lecture 2

Multi-join Query Evaluation on Big Data Lecture 2 Multi-join Query Evaluation on Big Data Lecture 2 Dan Suciu March, 2015 Dan Suciu Multi-Joins Lecture 2 March, 2015 1 / 34 Multi-join Query Evaluation Outline Part 1 Optimal Sequential Algorithms. Thursday

More information

STATISTICAL PERFORMANCE

STATISTICAL PERFORMANCE STATISTICAL PERFORMANCE PROVISIONING AND ENERGY EFFICIENCY IN DISTRIBUTED COMPUTING SYSTEMS Nikzad Babaii Rizvandi 1 Supervisors: Prof.Albert Y.Zomaya Prof. Aruna Seneviratne OUTLINE Introduction Background

More information

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qin 1, Martin Torres 2, and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced August 31, 2017 Machine

More information

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde ArcGIS Enterprise: What s New Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde ArcGIS Enterprise is the new name for ArcGIS for Server ArcGIS Enterprise Software Components ArcGIS Server Portal

More information

Lecture 5: Web Searching using the SVD

Lecture 5: Web Searching using the SVD Lecture 5: Web Searching using the SVD Information Retrieval Over the last 2 years the number of internet users has grown exponentially with time; see Figure. Trying to extract information from this exponentially

More information

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE Copyright 2013, SAS Institute Inc. All rights reserved. Agenda (Optional) History Lesson 2015 Buzzwords Machine Learning for X Citizen Data

More information

13 Searching the Web with the SVD

13 Searching the Web with the SVD 13 Searching the Web with the SVD 13.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

Using R for Iterative and Incremental Processing

Using R for Iterative and Incremental Processing Using R for Iterative and Incremental Processing Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Robert Schreiber UC Berkeley and HP Labs UC BERKELEY Big Data, Complex Algorithms PageRank (Dominant

More information

Introduction to ArcGIS GeoAnalytics Server. Sarah Ambrose & Noah Slocum

Introduction to ArcGIS GeoAnalytics Server. Sarah Ambrose & Noah Slocum Introduction to ArcGIS GeoAnalytics Server Sarah Ambrose & Noah Slocum Agenda Overview Analysis Capabilities + Demo Deployment and Configuration Questions ArcGIS GeoAnalytics Server uses the power of distributed

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 3: Query Processing Query Processing Decomposition Localization Optimization CS 347 Notes 3 2 Decomposition Same as in centralized system

More information

Progressive & Algorithms & Systems

Progressive & Algorithms & Systems University of California Merced Lawrence Berkeley National Laboratory Progressive Computation for Data Exploration Progressive Computation Online Aggregation (OLA) in DB Query Result Estimate Result ε

More information

9 Searching the Internet with the SVD

9 Searching the Internet with the SVD 9 Searching the Internet with the SVD 9.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

Jug: Executing Parallel Tasks in Python

Jug: Executing Parallel Tasks in Python Jug: Executing Parallel Tasks in Python Luis Pedro Coelho EMBL 21 May 2013 Luis Pedro Coelho (EMBL) Jug 21 May 2013 (1 / 24) Jug: Coarse Parallel Tasks in Python Parallel Python code Memoization Luis Pedro

More information

D2D SALES WITH SURVEY123, OP DASHBOARD, AND MICROSOFT SSAS

D2D SALES WITH SURVEY123, OP DASHBOARD, AND MICROSOFT SSAS D2D SALES WITH SURVEY123, OP DASHBOARD, AND MICROSOFT SSAS EDWARD GAUSE, GISP DIRECTOR OF INFORMATION SERVICES (ENGINEERING APPS) HTC (HORRY TELEPHONE COOP.) EDWARD GAUSE, GISP DIRECTOR OF INFORMATION

More information

Finding similar items

Finding similar items Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the

More information

Relational Algebra on Bags. Why Bags? Operations on Bags. Example: Bag Selection. σ A+B < 5 (R) = A B

Relational Algebra on Bags. Why Bags? Operations on Bags. Example: Bag Selection. σ A+B < 5 (R) = A B Relational Algebra on Bags Why Bags? 13 14 A bag (or multiset ) is like a set, but an element may appear more than once. Example: {1,2,1,3} is a bag. Example: {1,2,3} is also a bag that happens to be a

More information

How to Optimally Allocate Resources for Coded Distributed Computing?

How to Optimally Allocate Resources for Coded Distributed Computing? 1 How to Optimally Allocate Resources for Coded Distributed Computing? Qian Yu, Songze Li, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr Department of Electrical Engineering, University of Southern

More information

The Challenge of Geospatial Big Data Analysis

The Challenge of Geospatial Big Data Analysis 288 POSTERS The Challenge of Geospatial Big Data Analysis Authors - Teerayut Horanont, University of Tokyo, Japan - Apichon Witayangkurn, University of Tokyo, Japan - Shibasaki Ryosuke, University of Tokyo,

More information

EFFICIENT SUMMARIZATION TECHNIQUES FOR MASSIVE DATA

EFFICIENT SUMMARIZATION TECHNIQUES FOR MASSIVE DATA EFFICIENT SUMMARIZATION TECHNIQUES FOR MASSIVE DATA by Jeffrey Jestes A thesis submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of

More information

Data Intensive Computing Handout 11 Spark: Transitive Closure

Data Intensive Computing Handout 11 Spark: Transitive Closure Data Intensive Computing Handout 11 Spark: Transitive Closure from random import Random spark/transitive_closure.py num edges = 3000 num vertices = 500 rand = Random(42) def generate g raph ( ) : edges

More information

Harvard Center for Geographic Analysis Geospatial on the MOC

Harvard Center for Geographic Analysis Geospatial on the MOC 2017 Massachusetts Open Cloud Workshop Boston University Harvard Center for Geographic Analysis Geospatial on the MOC Ben Lewis Harvard Center for Geographic Analysis Aaron Williams MapD Small Team Supporting

More information

CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing

CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing Hector Garcia-Molina Zoltan Gyongyi CS 347 Notes 03 1 Query Processing! Decomposition! Localization! Optimization CS 347

More information

Assignment 1 Physics/ECE 176

Assignment 1 Physics/ECE 176 Assignment 1 Physics/ECE 176 Made available: Thursday, January 13, 211 Due: Thursday, January 2, 211, by the beginning of class. Overview Before beginning this assignment, please read carefully the part

More information

Linear Algebra March 16, 2019

Linear Algebra March 16, 2019 Linear Algebra March 16, 2019 2 Contents 0.1 Notation................................ 4 1 Systems of linear equations, and matrices 5 1.1 Systems of linear equations..................... 5 1.2 Augmented

More information

arxiv: v3 [stat.ml] 14 Apr 2016

arxiv: v3 [stat.ml] 14 Apr 2016 arxiv:1307.0048v3 [stat.ml] 14 Apr 2016 Simple one-pass algorithm for penalized linear regression with cross-validation on MapReduce Kun Yang April 15, 2016 Abstract In this paper, we propose a one-pass

More information

CS224W: Methods of Parallelized Kronecker Graph Generation

CS224W: Methods of Parallelized Kronecker Graph Generation CS224W: Methods of Parallelized Kronecker Graph Generation Sean Choi, Group 35 December 10th, 2012 1 Introduction The question of generating realistic graphs has always been a topic of huge interests.

More information

YYT-C3002 Application Programming in Engineering GIS I. Anas Altartouri Otaniemi

YYT-C3002 Application Programming in Engineering GIS I. Anas Altartouri Otaniemi YYT-C3002 Application Programming in Engineering GIS I Otaniemi Overview: GIS lectures & exercise We will deal with GIS application development in two lectures. Because of the versatility of GIS data models

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering

More information

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Chengjie Qin 1 and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced June 29, 2017 Machine Learning (ML) Is

More information

Machine Learning for Gravitational Wave signals classification in LIGO and Virgo

Machine Learning for Gravitational Wave signals classification in LIGO and Virgo Machine Learning for Gravitational Wave signals classification in LIGO and Virgo Elena Cuoco European Gravitational Observatory www.elenacuoco.com @elenacuoco 2 About me About me Working as Data Analyst

More information

WDCloud: An End to End System for Large- Scale Watershed Delineation on Cloud

WDCloud: An End to End System for Large- Scale Watershed Delineation on Cloud WDCloud: An End to End System for Large- Scale Watershed Delineation on Cloud * In Kee Kim, * Jacob Steele, + Anthony Castronova, * Jonathan Goodall, and * Marty Humphrey * University of Virginia + Utah

More information

CS5112: Algorithms and Data Structures for Applications

CS5112: Algorithms and Data Structures for Applications CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining

More information

CS Data Structures and Algorithm Analysis

CS Data Structures and Algorithm Analysis CS 483 - Data Structures and Algorithm Analysis Lecture II: Chapter 2 R. Paul Wiegand George Mason University, Department of Computer Science February 1, 2006 Outline 1 Analysis Framework 2 Asymptotic

More information

ArcGIS is Advancing. Both Contributing and Integrating many new Innovations. IoT. Smart Mapping. Smart Devices Advanced Analytics

ArcGIS is Advancing. Both Contributing and Integrating many new Innovations. IoT. Smart Mapping. Smart Devices Advanced Analytics ArcGIS is Advancing IoT Smart Devices Advanced Analytics Smart Mapping Real-Time Faster Computing Web Services Crowdsourcing Sensor Networks Both Contributing and Integrating many new Innovations ArcGIS

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

1 [15 points] Frequent Itemsets Generation With Map-Reduce

1 [15 points] Frequent Itemsets Generation With Map-Reduce Data Mining Learning from Large Data Sets Final Exam Date: 15 August 2013 Time limit: 120 minutes Number of pages: 11 Maximum score: 100 points You can use the back of the pages if you run out of space.

More information

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Lecture 1 September 3, 2013

Lecture 1 September 3, 2013 CS 229r: Algorithms for Big Data Fall 2013 Prof. Jelani Nelson Lecture 1 September 3, 2013 Scribes: Andrew Wang and Andrew Liu 1 Course Logistics The problem sets can be found on the course website: http://people.seas.harvard.edu/~minilek/cs229r/index.html

More information

Lecture: Local Spectral Methods (1 of 4)

Lecture: Local Spectral Methods (1 of 4) Stat260/CS294: Spectral Graph Methods Lecture 18-03/31/2015 Lecture: Local Spectral Methods (1 of 4) Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough. They provide

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Week #1 Today Introduction to machine learning The course (syllabus) Math review (probability + linear algebra) The future

More information

Course Announcements. Bacon is due next Monday. Next lab is about drawing UIs. Today s lecture will help thinking about your DB interface.

Course Announcements. Bacon is due next Monday. Next lab is about drawing UIs. Today s lecture will help thinking about your DB interface. Course Announcements Bacon is due next Monday. Today s lecture will help thinking about your DB interface. Next lab is about drawing UIs. John Jannotti (cs32) ORMs Mar 9, 2017 1 / 24 ORMs John Jannotti

More information

Collaborative topic models: motivations cont

Collaborative topic models: motivations cont Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.

More information

COMPUTER SCIENCE TRIPOS

COMPUTER SCIENCE TRIPOS CST.2016.2.1 COMPUTER SCIENCE TRIPOS Part IA Tuesday 31 May 2016 1.30 to 4.30 COMPUTER SCIENCE Paper 2 Answer one question from each of Sections A, B and C, and two questions from Section D. Submit the

More information

Lecture 26 December 1, 2015

Lecture 26 December 1, 2015 CS 229r: Algorithms for Big Data Fall 2015 Prof. Jelani Nelson Lecture 26 December 1, 2015 Scribe: Zezhou Liu 1 Overview Final project presentations on Thursday. Jelani has sent out emails about the logistics:

More information

One Optimized I/O Configuration per HPC Application

One Optimized I/O Configuration per HPC Application One Optimized I/O Configuration per HPC Application Leveraging I/O Configurability of Amazon EC2 Cloud Mingliang Liu, Jidong Zhai, Yan Zhai Tsinghua University Xiaosong Ma North Carolina State University

More information

The conceptual view. by Gerrit Muller University of Southeast Norway-NISE

The conceptual view. by Gerrit Muller University of Southeast Norway-NISE by Gerrit Muller University of Southeast Norway-NISE e-mail: gaudisite@gmail.com www.gaudisite.nl Abstract The purpose of the conceptual view is described. A number of methods or models is given to use

More information

Introduction to the Simplex Algorithm Active Learning Module 3

Introduction to the Simplex Algorithm Active Learning Module 3 Introduction to the Simplex Algorithm Active Learning Module 3 J. René Villalobos and Gary L. Hogg Arizona State University Paul M. Griffin Georgia Institute of Technology Background Material Almost any

More information

GeoSpark: A Cluster Computing Framework for Processing Spatial Data

GeoSpark: A Cluster Computing Framework for Processing Spatial Data GeoSpark: A Cluster Computing Framework for Processing Spatial Data Jia Yu School of Computing, Informatics, and Decision Systems Engineering, Arizona State University 699 S. Mill Avenue, Tempe, AZ jiayu2@asu.edu

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based

More information

7.5 Operations with Matrices. Copyright Cengage Learning. All rights reserved.

7.5 Operations with Matrices. Copyright Cengage Learning. All rights reserved. 7.5 Operations with Matrices Copyright Cengage Learning. All rights reserved. What You Should Learn Decide whether two matrices are equal. Add and subtract matrices and multiply matrices by scalars. Multiply

More information

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM SPATIAL DATA MINING Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM INTRODUCTION The main difference between data mining in relational DBS and in spatial DBS is that attributes of the neighbors

More information

CSC2515 Assignment #2

CSC2515 Assignment #2 CSC2515 Assignment #2 Due: Nov.4, 2pm at the START of class Worth: 18% Late assignments not accepted. 1 Pseudo-Bayesian Linear Regression (3%) In this question you will dabble in Bayesian statistics and

More information

ECS231: Spectral Partitioning. Based on Berkeley s CS267 lecture on graph partition

ECS231: Spectral Partitioning. Based on Berkeley s CS267 lecture on graph partition ECS231: Spectral Partitioning Based on Berkeley s CS267 lecture on graph partition 1 Definition of graph partitioning Given a graph G = (N, E, W N, W E ) N = nodes (or vertices), E = edges W N = node weights

More information

IR: Information Retrieval

IR: Information Retrieval / 44 IR: Information Retrieval FIB, Master in Innovation and Research in Informatics Slides by Marta Arias, José Luis Balcázar, Ramon Ferrer-i-Cancho, Ricard Gavaldá Department of Computer Science, UPC

More information

Feature Engineering, Model Evaluations

Feature Engineering, Model Evaluations Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering

More information

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit: Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify

More information

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On Two Class-Constrained Versions of the Multiple Knapsack Problem On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic

More information

Linear Algebra. Introduction. Marek Petrik 3/23/2017. Many slides adapted from Linear Algebra Lectures by Martin Scharlemann

Linear Algebra. Introduction. Marek Petrik 3/23/2017. Many slides adapted from Linear Algebra Lectures by Martin Scharlemann Linear Algebra Introduction Marek Petrik 3/23/2017 Many slides adapted from Linear Algebra Lectures by Martin Scharlemann Midterm Results Highest score on the non-r part: 67 / 77 Score scaling: Additive

More information

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations Lab 2 Worksheet Problems Problem : Geometry and Linear Equations Linear algebra is, first and foremost, the study of systems of linear equations. You are going to encounter linear systems frequently in

More information

Chuck Cartledge, PhD. 21 January 2018

Chuck Cartledge, PhD. 21 January 2018 Big Data: Data Analysis Boot Camp Non-SQL and R Chuck Cartledge, PhD 21 January 2018 1/19 Table of contents (1 of 1) 1 Intro. 2 Non-SQL DBMS Classic Non-SQL databases 3 Hands-on Airport connections as

More information

LECTURES 4/5: SYSTEMS OF LINEAR EQUATIONS

LECTURES 4/5: SYSTEMS OF LINEAR EQUATIONS LECTURES 4/5: SYSTEMS OF LINEAR EQUATIONS MA1111: LINEAR ALGEBRA I, MICHAELMAS 2016 1 Linear equations We now switch gears to discuss the topic of solving linear equations, and more interestingly, systems

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University. Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means - Clustering

More information

THE STATE OF CONTEMPORARY COMPUTING SUBSTRATES FOR OPTIMIZATION METHODS. Benjamin Recht UC Berkeley

THE STATE OF CONTEMPORARY COMPUTING SUBSTRATES FOR OPTIMIZATION METHODS. Benjamin Recht UC Berkeley THE STATE OF CONTEMPORARY COMPUTING SUBSTRATES FOR OPTIMIZATION METHODS Benjamin Recht UC Berkeley MY QUIXOTIC QUEST FOR SUPERLINEAR ALGORITHMS Benjamin Recht UC Berkeley Collaborators Slides extracted

More information

Engineering for Compatibility

Engineering for Compatibility W17 Compatibility Testing Wednesday, October 3rd, 2018 3:00 PM Engineering for Compatibility Presented by: Melissa Benua mparticle Brought to you by: 350 Corporate Way, Suite 400, Orange Park, FL 32073

More information

NAVAL POSTGRADUATE SCHOOL

NAVAL POSTGRADUATE SCHOOL NAVAL POSTGRADUATE SCHOOL MONTEREY, CALIFORNIA THESIS DATA MINING OF EXTREMELY LARGE AD-HOC DATA SETS TO PRODUCE REVERSE WEB-LINK GRAPHS by Tao-hsiang Chang March 2017 Thesis Co-Advisors: Frank Kragh Geoffrey

More information

Administering your Enterprise Geodatabase using Python. Jill Penney

Administering your Enterprise Geodatabase using Python. Jill Penney Administering your Enterprise Geodatabase using Python Jill Penney Assumptions Basic knowledge of python Basic knowledge enterprise geodatabases and workflows You want code Please turn off or silence cell

More information

Introduction to ArcGIS Server Development

Introduction to ArcGIS Server Development Introduction to ArcGIS Server Development Kevin Deege,, Rob Burke, Kelly Hutchins, and Sathya Prasad ESRI Developer Summit 2008 1 Schedule Introduction to ArcGIS Server Rob and Kevin Questions Break 2:15

More information

Scalable 3D Spatial Queries for Analytical Pathology Imaging with MapReduce

Scalable 3D Spatial Queries for Analytical Pathology Imaging with MapReduce Scalable 3D Spatial Queries for Analytical Pathology Imaging with MapReduce Yanhui Liang, Stony Brook University Hoang Vo, Stony Brook University Ablimit Aji, Hewlett Packard Labs Jun Kong, Emory University

More information

QALGO workshop, Riga. 1 / 26. Quantum algorithms for linear algebra.

QALGO workshop, Riga. 1 / 26. Quantum algorithms for linear algebra. QALGO workshop, Riga. 1 / 26 Quantum algorithms for linear algebra., Center for Quantum Technologies and Nanyang Technological University, Singapore. September 22, 2015 QALGO workshop, Riga. 2 / 26 Overview

More information

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS LINK ANALYSIS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Retrieval evaluation Link analysis Models

More information

CS60021: Scalable Data Mining. Dimensionality Reduction

CS60021: Scalable Data Mining. Dimensionality Reduction J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Dimensionality Reduction Sourangshu Bhattacharya Assumption: Data lies on or near a

More information

A New Lower Bound Technique for Quantum Circuits without Ancillæ

A New Lower Bound Technique for Quantum Circuits without Ancillæ A New Lower Bound Technique for Quantum Circuits without Ancillæ Debajyoti Bera Abstract We present a technique to derive depth lower bounds for quantum circuits. The technique is based on the observation

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #3: SQL---Part 1

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #3: SQL---Part 1 CS 4604: Introduc0on to Database Management Systems B. Aditya Prakash Lecture #3: SQL---Part 1 Announcements---Project Goal: design a database system applica=on with a web front-end Project Assignment

More information