Knowledge Discovery and Data Mining 1 (VO) ( )

Similar documents
CS425: Algorithms for Web Scale Data

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Computational Frameworks. MapReduce

Computational Frameworks. MapReduce

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce

A Tale of Two Erasure Codes in HDFS

Query Analyzer for Apache Pig

How to deal with uncertainties and dynamicity?

Knowledge Discovery and Data Mining 1 (VO) ( )

Poisson Chris Piech CS109, Stanford University. Piech, CS106A, Stanford University

Knowledge Discovery and Data Mining 1 (VO) ( )

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Distributed Architectures

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Knowledge Discovery and Data Mining 1 (VO) ( )

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Northwestern University Department of Electrical Engineering and Computer Science

IR: Information Retrieval

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

PI SERVER 2012 Do. More. Faster. Now! Copyr i g h t 2012 O S Is o f t, L L C. 1

Spatial Analytics Workshop

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Using R for Iterative and Incremental Processing

High-Performance Scientific Computing

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

BMIR Lecture Series on Probability and Statistics Fall, 2015 Uniform Distribution

Large-Scale Behavioral Targeting

Discrete-event simulations

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Progressive & Algorithms & Systems

Ching-Han Hsu, BMES, National Tsing Hua University c 2015 by Ching-Han Hsu, Ph.D., BMIR Lab. = a + b 2. b a. x a b a = 12

Exploring Human Mobility with Multi-Source Data at Extremely Large Metropolitan Scales. ACM MobiCom 2014, Maui, HI

a zoo of (discrete) random variables

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Exercises Solutions. Automation IEA, LTH. Chapter 2 Manufacturing and process systems. Chapter 5 Discrete manufacturing problems

Multi-join Query Evaluation on Big Data Lecture 2

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

INFO 2950 Intro to Data Science. Lecture 18: Power Laws and Big Data

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Basics of Stochastic Modeling: Part II

Slides 8: Statistical Models in Simulation

The conceptual view. by Gerrit Muller University of Southeast Norway-NISE

Introduction to Randomized Algorithms III

NICTA Short Course. Network Analysis. Vijay Sivaraman. Day 1 Queueing Systems and Markov Chains. Network Analysis, 2008s2 1-1

Behavioral Simulations in MapReduce

EE126: Probability and Random Processes

Math Review Sheet, Fall 2008

Summarizing Measured Data

EECS 126 Probability and Random Processes University of California, Berkeley: Fall 2014 Kannan Ramchandran November 13, 2014.

STATISTICAL PERFORMANCE

CPU Scheduling. CPU Scheduler

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models

EE/CpE 345. Modeling and Simulation. Fall Class 5 September 30, 2002

Part I Stochastic variables and Markov chains

Continuous-time Markov Chains

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr.

Chapter Learning Objectives. Probability Distributions and Probability Density Functions. Continuous Random Variables

Revisiting Memory Errors in Large-Scale Production Data Centers

Visualizing Big Data on Maps: Emerging Tools and Techniques. Ilir Bejleri, Sanjay Ranka

Chapter 5. Statistical Models in Simulations 5.1. Prof. Dr. Mesut Güneş Ch. 5 Statistical Models in Simulations

Brief Review of Probability

Parallel Transposition of Sparse Data Structures

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Math Spring Practice for the final Exam.

Computer Algorithms CISC4080 CIS, Fordham Univ. Outline. Last class. Instructor: X. Zhang Lecture 2

Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2

BINOMIAL DISTRIBUTION

CS246 Final Exam, Winter 2011

Simulating events: the Poisson process

Midterm Exam 1 (Solutions)

Algorithms and Data S tructures Structures Complexity Complexit of Algorithms Ulf Leser

Stochastic Modelling of Electron Transport on different HPC architectures

Optimization and Simulation

Design of discrete-event simulations

Estimates for factoring 1024-bit integers. Thorsten Kleinjung, University of Bonn

Lecture 08: Poisson and More. Lisa Yan July 13, 2018

Chapter 3. Discrete Random Variables and Their Probability Distributions

Online Scheduling Switch for Maintaining Data Freshness in Flexible Real-Time Systems

SUMMARIZING MEASURED DATA. Gaia Maselli

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Lecture 5: Web Searching using the SVD

Chapter 4 Continuous Random Variables and Probability Distributions

On Two Class-Constrained Versions of the Multiple Knapsack Problem

CHAPTER 3 MATHEMATICAL AND SIMULATION TOOLS FOR MANET ANALYSIS

Lecture: Local Spectral Methods (1 of 4)

CHAPTER 6. 1, if n =1, 2p(1 p), if n =2, n (1 p) n 1 n p + p n 1 (1 p), if n =3, 4, 5,... var(d) = 4var(R) =4np(1 p).

Lecture 4: Bernoulli Process

Transcription:

Knowledge Discovery and Data Mining 1 (VO) (707.003) Map-Reduce Denis Helic KTI, TU Graz Oct 24, 2013 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 1 / 82

Big picture: KDDM Probability Theory Linear Algebra Map-Reduce Information Theory Statistical Inference Mathematical Tools Infrastructure Knowledge Discovery Process Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 2 / 82

Outline 1 Motivation 2 Large Scale Computation 3 Map-Reduce 4 Environment 5 Map-Reduce Skew Slides Slides are partially based on Mining Massive Datasets course from Stanford University by Jure Leskovec Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 3 / 82

Map-Reduce Motivation Today s data is huge Challenges How to distribute computation? Distributed/parallel programming is hard Map-reduce addresses both of these points Google s computational/data manipulation model Elegant way to work with huge data Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 4 / 82

Motivation Single node architecture CPU Memory Data fits in memory Machine learning, statistics Classical data mining Memory Disk Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 5 / 82

Motivation Motivation: Google example 20+ billion Web pages Approx. 20 KB per page Approx. 400+ TB for the whole Web Approx. 1000 hard drives to store the Web A single computer reads 30 35 MB/s from disk Approx. 4 months to read the Web with a single computer Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 6 / 82

Motivation Motivation: Google example Takes even more time to do something with the data E.g. to calculate the PageRank If m is the number of the links on the Web Average degree on the Web is approx. 10, thus m 2 10 11 To calculate PageRank we need per iteration step m multiplications We need approx. 100+ iteration steps Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 7 / 82

Motivation Motivation: Google example Today a standard architecture for such problems is emerging Cluster of commodity Linux nodes Commodity network (ethernet) to connect them 2 10 Gbps between racks 1 Gbps within racks Each rack contains 16 64 nodes Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 8 / 82

Cluster architecture Motivation Switch Switch Switch CPU CPU CPU CPU Memory Memory Memory......... Memory Memory Disk Memory Disk Memory Disk Memory Disk Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 9 / 82

Motivation Motivation: Google example 2011 estimation: Google had approx. 1 million machines http://www.datacenterknowledge.com/archives/2011/08/01/ report-google-uses-about-900000-servers/ Other examples: Facebook, Twitter, Amazon, etc. But also smaller examples: e.g. Wikipedia Single source shortest path: m + n time complexity, approx. 260 10 6 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 10 / 82

Large Scale Computation Large scale computation Large scale computation for data mining on commodity hardware Challenges How to distribute computation? How can we make it easy to write distributed programs? How to cope with machine failures? Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 11 / 82

Large Scale Computation Large scale computation: machine failures One server may stay up 3 years (1000 days) The failure rate per day: p = 10 3 How many failures per day if we have n machines? Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 12 / 82

Large Scale Computation Large scale computation: machine failures One server may stay up 3 years (1000 days) The failure rate per day: p = 10 3 How many failures per day if we have n machines? Binomial r.v. Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 12 / 82

Large Scale Computation Large scale computation: machine failures PMF of a Binomial r.v. p(k) = ( ) n (1 p) n k p k k Expectation of a Binomial r.v. E[X ] = np Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 13 / 82

Large Scale Computation Large scale computation: machine failures n = 1000, E[X ] = 1 If we have 1000 machines we lose one per day n = 1000000, E[X ] = 1000 If we have 1 million machines (Google) we lose 1 thousand per day Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 14 / 82

Large Scale Computation Coping with node failures: Exercise Exercise Suppose a job consists of n tasks, each of which takes time T seconds. Thus, if there are no failures, the sum over all compute nodes of the time taken to execute tasks at that node is nt. Suppose also that the probability of a task failing is p per job per second, and when a task fails, the overhead of management of the restart is such that it adds 10T seconds to the total execution time of the job. What is the total expected execution time of the job? Example Example 2.4.1 from Mining Massive Datasets. Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 15 / 82

Large Scale Computation Coping with node failures: Exercise Failure of a single task is a Bernoulli r.v. with parameter p The number of failures in n tasks is a Binomial with parameter n and p PMF of a Binomial r.v. p(k) = ( ) n (1 p) n k p k k Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 16 / 82

Large Scale Computation Coping with node failures: Exercise PMF The time until the first failure of a task is a Geometric r.v. with parameter p p(k) = (1 p) k 1 p Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 17 / 82

Large Scale Computation Coping with node failures: Exercise If we go to very fine scales we can approximate a Geometric r.v. with an Exponential r.v. with λ = p The time (T ) until a task fails is distributed exponentially PDF f (t; λ) = { λe λt, t 0 0, t < 0 CDF F (t; λ) = { 1 e λt, t 0 0, t < 0 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 18 / 82

Large Scale Computation Coping with node failures: Exercise Expected execution time of a task Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 19 / 82

Large Scale Computation Coping with node failures: Exercise Expected execution time of a task E[T ] = P S T S + P F (E[T L ] + T R + E[T ]) P S = 1 P F P F = 1 e λts P S = e λts Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 19 / 82

Large Scale Computation Coping with node failures: Exercise After simplifying, we get: E[T ] = T S + P F 1 P F (E[T L ] + T R ) P F 1 P F = 1 e λts e λt S = e λt S 1 (1) E[T ] = T S + (e λt S 1)(E[T L ] + T R ) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 20 / 82

Large Scale Computation Coping with node failures: Exercise E[T L ] =? What is PDF of T L T L models time lost because of failure Time is lost if and only if a failure occurs before the task finishes, i.e. we know that within [0, T S ] a failure has occurred Let this be an event B Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 21 / 82

Large Scale Computation Coping with node failures: Exercise What is P(B)? P(B) = F (T S ) = 1 e λt S Our T L is now a r.v. conditioned on event B I.e. we are interested in the probability of event A (failure occurs at time t < T S ) given that B occurred P(A) = λe λt Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 22 / 82

Large Scale Computation Coping with node failures: Exercise P(A B) = P(A B) P(B) What is A B? A: failure occurs at time t < T S B: failure occurs within [0, T S ] A B: failure occurs at time t < T S A B = A Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 23 / 82

Large Scale Computation Coping with node failures: Exercise P(A B) = P(A) P(B) PDF of a r.v. T L f (t) = λe λt 1 e λt S Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 24 / 82

Large Scale Computation Coping with node failures: Exercise Expectation E[T L ]: E[T L ] = tf (t)dt E[T L ] = = TS 0 λe λt t 1 e λt dt S 1 1 e λt S TS 0 tλe λt dt (2) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 25 / 82

Large Scale Computation Coping with node failures: Exercise E[T L ] = 1 1 e λt S = 1 λ T S e λt S 1 ] [ te λt e λt T S λ 0 E[T ] = T S + (e λt S 1)( 1 λ T S e λt S 1 + T R ) = (e λt S 1)( 1 λ + T R) (3) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 26 / 82

Large Scale Computation Coping with node failures: Exercise For a single task: E[T ] = (e pt 1)( 1 p + 10T ) For n tasks: E[T ] = n(e pt 1)( 1 p + 10T ) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 27 / 82

Large Scale Computation Coordination Using this information we can improve scheduling We can also optimize checking for node failures Check-pointing strategies Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 28 / 82

Large Scale Computation Large scale computation: data copying Copying data over network takes time Bring data closer to computation I.e. process data locally at each node Replicate data to increase reliability Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 29 / 82

Solution: Map-reduce Map-Reduce Storage infrastructure Distributed file system Google: GFS Hadoop: HDFS Programming model Map-reduce Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 30 / 82

Storage infrastructure Map-Reduce Problem: if node fails how to store a file persistently Distributed file system Provides global file namespace Typical usage pattern Huge files: several 100s GB to 1 TB Data is rarely updated in place Reads and appends are common Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 31 / 82

Distributed file system Map-Reduce Chunk servers File is split into contiguous chunks Typically each chunk is 16 64 MB Each chunk is replicated (usually 2x or 3x) Try to keep replicas in different racks Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 32 / 82

Distributed file system Map-Reduce Master node Stores metadata about where files are stored Might be replicated Client library for file access Talks to master node to find chunk servers Connects directly to chunk servers to access data Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 33 / 82

Distributed file system Map-Reduce Reliable distributed file system Data kept in chunks spread across machines Each chunk replicated on different machines Reliable distributed file system Seamless recovery from node failures Bring computation directly to data Chunk servers also used as computation nodes Seamless recovery from disk or machine failure C 0 C 1 C 5 C 2 D 0 C 5 C 1 C 3 C 2 D 0 C 5 D 1 C 0 C 5 D 0 C 2 Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N Bring computation directly to the data! Figure: Figure from slides by Jure Leskovec Chunk servers also serve as compute servers 1/8/2013 Denis Helic (KTI, TU Graz) Jure Leskovec, Stanford CS246: Mining KDDM1 Massive Datasets, http://cs246.stanford.edu Oct 24, 2013 34 / 8233

Map-Reduce Programming model: Map-reduce Running example We want to count the number of occurrences for each word in a collection of documents. In this example, the input file is a repository of documents, and each document is an element. Example Example is meanwhile a standard Map-reduce example. Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 35 / 82

Map-Reduce Programming model: Map-reduce words input_file sort uniq -c Three step process 1 Split file into words, each word on a separate line 2 Group and sort all words 3 Count the occurrences Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 36 / 82

Map-Reduce Programming model: Map-reduce This captures the essence of Map-reduce Split Group Count Naturally parallelizable E.g. split and count Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 37 / 82

Map-Reduce Programming model: Map-reduce Sequentially read a lot of data Map: extract something that you care about (key, value) Group by key: sort and shuffle Reduce: Aggregate, summarize, filter or transform Write the result Outline Outline is always the same: Map and Reduce change to fit the problem Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 38 / 82

tasks, so all key-value Map-Reduce pairs with the same key wind up at the same Reduce task. Programming model: Map-reduce 3. The Reduce tasks work on one key at a time, and combine all the values associated with that key in some way. The manner of combination of values is determined by the code written by the user for the Reduce function. Figure 2.2 suggests this computation. Input chunks Key value pairs (k,v) Keys with all their values (k, [v, w,...]) Combined output Map tasks Group by keys Reduce tasks Figure 2.2: Schematic of a map-reduce computation Figure: Figure from the book: Mining massive datasets 2.2.1 The Map Tasks We view input files for a Map task as consisting of elements, which can be any type: a tuple or a document, for example. A chunk is a collection of elements, and no element is stored across two chunks. Technically, all inputs Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 39 / 82

Map-Reduce Programming model: Map-reduce Input: a set of (key, value) pairs (e.g. key is the filename, value is a single line in the file) Map(k, v) (k, v ) Takes a (k, v) pair and outputs a set of (k, v ) pairs There is one Map call for each (k, v) pair Reduce(k, (v ) ) (k, v ) All values v with same key k are reduced together and processed in v order There is one Reduce call for each unique k Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 40 / 82

Map-Reduce Programming model: Map-reduce Big Document Map Group by key Reduce Star Wars is an American epic space opera franchise centered on a film series created by George Lucas. The film series has spawned an extensive media franchise called the Expanded Universe including books, television series, computer and video games, and comic books. These supplements to the two film trilogies... (Star, 1) (Wars, 1) (is, 1) (an, 1) (American, 1) (epic, 1) (space, 1) (opera, 1) (franchise, 1) (centered, 1) (on, 1) (a, 1) (film, 1) (series, 1) (created, 1) (by, 1)...... (Star, 1) (Star, 1) (Wars, 1) (Wars, 1) (a, 1) (a, 1) (a, 1) (a, 1) (a, 1) (a, 1) (film, 1) (film, 1) (film, 1) (franchise, 1) (series, 1) (series, 1)...... (Star, 2) (Wars, 2) (a, 6) (film, 3) (franchise, 1) (series, 2)...... Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 41 / 82

Map-Reduce Programming model: Map-reduce map(key, value): // key: document name // value: a single line from a document foreach word w in value: emit(w, 1) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 42 / 82

Map-Reduce Programming model: Map-reduce reduce(key, values): // key: a word // values: an iterator over counts result = 0 foreach count c in values: result += c emit(key, result) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 43 / 82

Map-reduce computation Environment Map-reduce environment takes care of: 1 Partitioning the input data 2 Scheduling the program s execution across a set of machines 3 Performing the group by key step 4 Handling machine failures 5 Managing required inter-machine communication Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 44 / 82

Map-reduce computation Environment MAP: Read input and produces a set of key-value pairs Big document Group by key: Collect all pairs with same key (Hash merge, Shuffle, Sort, Partition) Reduce: Collect all values belonging to the key and output 1/8/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 43 Figure: Figure from the course by Jure Leskovec (Stanford University) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 45 / 82

Map-reduce computation Environment All phases are distributed with many tasks doing the work Figure: Figure from the course by Jure Leskovec (Stanford University) 1/8/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 44 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 46 / 82

Environment Data flow Input and final output are stored in distributed file system Scheduler tries to schedule map tasks close to physical storage location of input data Intermediate results are stored on local file systems of Map and Reduce workers Output is often input to another Map-reduce computation Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 47 / 82

Environment Coordination: Master Master node takes care of coordination Task status, e.g. idle, in-progress, completed Idle tasks get scheduled as workers become available When a Map task completes, it notifies the master about the size and location of its intermediate files Master pushes this info to reducers Master pings workers periodically to detect failures Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 48 / 82

Environment Map-reduce execution details 2.2. MAP-REDUCE 27 User Program fork fork Master fork assign Map assign Reduce Worker Worker Worker Worker Input Data Worker Intermediate Files Output File Figure 2.3: Overview of the execution of a map-reduce program Figure: Figure from the book: Mining massive datasets executing at a particular Worker, or completed). A Worker process reports to the Master when it finishes a task, and a new task is scheduled by the Master for that Worker process. Each Map task is assigned one or more chunks of the input file(s) and executes on it the code written by the user. The Map task creates a file for Denis Helic (KTI, TUeach Graz) Reduce task on the local disk KDDM1 of the Worker that executes the Map task. Oct 24, 2013 49 / 82

Maximizing parallelism Map-Reduce Skew If we want maximum parallelism then Use one Reduce task for each reducer (i.e. a single key and its associated value list) Execute each Reduce task at a different compute node The plan is typically not the best one Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 50 / 82

Map-Reduce Skew Maximizing parallelism There is overhead associated with each task we create We might want to keep the number of Reduce tasks lower than the number of different keys We do not want to create a task for a key with a short list There are often far more keys than there are compute nodes E.g. count words from Wikipedia or from the Web Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 51 / 82

Map-Reduce Skew Input data skew: Exercise Exercise Suppose we execute the word-count map-reduce program on a large repository such as a copy of the Web. We shall use 100 Map tasks and some number of Reduce tasks. 1 Do you expect there to be significant skew in the times taken by the various reducers to process their value list? Why or why not? 2 If we combine the reducers into a small number of Reduce tasks, say 10 tasks, at random, do you expect the skew to be significant? What if we instead combine the reducers into 10,000 Reduce tasks? Example Example is based on the example 2.2.1 from Mining Massive Datasets. Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 52 / 82

Maximizing parallelism Map-Reduce Skew There is often significant variation in the lengths of value list for different keys Different reducers take different amounts of time to finish If we make each reducer a separate Reduce task then the task execution times will exhibit significant variance Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 53 / 82

Map-Reduce Skew Input data skew Input data skew describes an uneven distribution of the number of values per key Examples include power-law graphs, e.g. the Web or Wikipedia Other data with Zipfian distribution E.g. the number of word occurrences Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 54 / 82

Map-Reduce Skew Power-law (Zipf) random variable PMF p(k) = k α ζ(α) k N, k 1, α > 1 ζ(α) is the Riemann zeta function ζ(α) = k=1 k α Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 55 / 82

Map-Reduce Skew Power-law (Zipf) random variable Probability mass function of a Zipf random variable; differing α values 0.9 α =2.0 0.8 α =3.0 0.7 0.6 probability of k 0.5 0.4 0.3 0.2 0.1 0.0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 k Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 56 / 82

Map-Reduce Skew Power-law (Zipf) input data Key size: sum =189681,µ =1.897,σ 2 =58.853 10 5 10 4 10 3 10 2 10 1 10 0 0 2 4 6 8 10 12 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 57 / 82

Map-Reduce Skew Tackling input data skew We need to distribute a skewed (power-law) input data into a number of reducers/reduce tasks/compute nodes The distribution of the key lengths inside of reducers/reduce tasks/compute nodes should be approximately normal The variance of these distributions should be smaller than the original variance If variance is small an efficient load balancing is possible Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 58 / 82

Map-Reduce Skew Tackling input data skew Each Reduce task receives a number of keys The total number of values to process is the sum of the number of values over all keys The average number of values that a Reduce task processes is the average of the number of values over all keys Equivalently, each compute node receives a number of Reduce tasks The sum and average for a compute node is the sum and average over all Reduce tasks for that node Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 59 / 82

Map-Reduce Skew Tackling input data skew How should we distribute keys to Reduce tasks? Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 60 / 82

Map-Reduce Skew Tackling input data skew How should we distribute keys to Reduce tasks? Uniformly at random Other possibilities? Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 60 / 82

Map-Reduce Skew Tackling input data skew How should we distribute keys to Reduce tasks? Uniformly at random Other possibilities? Calculate the capacity of a single Reduce task Add keys until capacity is reached, etc. Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 60 / 82

Map-Reduce Skew Tackling input data skew We are averaging over a skewed distribution Are there laws that describe how the averages of sufficiently large samples drawn from a probability distribution behaves? In other words, how are the averages of samples of a r.v. distributed? Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 61 / 82

Map-Reduce Skew Tackling input data skew We are averaging over a skewed distribution Are there laws that describe how the averages of sufficiently large samples drawn from a probability distribution behaves? In other words, how are the averages of samples of a r.v. distributed? Central-limit Theorem Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 61 / 82

Map-Reduce Skew Central-Limit Theorem The central-limit theorem describes the distribution of the arithmetic mean of sufficiently large samples of independent and identically distributed random variables The means are normally distributed The mean of the new distribution equals the mean of the original distribution The variance of the new distribution equals σ2 n, where σ2 is the variance of the original distribution Thus, we keep the mean and reduce the variance Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 62 / 82

Central-Limit Theorem Map-Reduce Skew Theorem Suppose X 1,..., X n are independent and identical r.v. with the expectation µ and variance σ 2. Let Y be a r.v. defined as: Y n = 1 n n i=1 X i The CDF F n (y) tends to PDF of a normal r.v. with the mean µ and variance σ 2 for n : lim F 1 y n(y) = e (x µ)2 2σ 2 n 2πσ 2 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 63 / 82

Map-Reduce Skew Central-Limit Theorem Example Practically, it is possible to replace F n (y) with a normal distribution for n > 30 We should always average over at least 30 values Approximating uniform r.v. with a normal r.v. by sampling and averaging Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 64 / 82

Central-Limit Theorem Map-Reduce Skew 18 Averages: µ =0.5,σ 2 =0.08333 16 14 12 10 8 6 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 65 / 82

Central-Limit Theorem Map-Reduce Skew 40 Averages: µ =0.499,σ 2 =0.00270 35 30 25 20 15 10 5 0 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 66 / 82

Map-Reduce Skew Central-Limit Theorem IPython Notebook examples http: //kti.tugraz.at/staff/denis/courses/kddm1/clt.ipynb Command Line ipython notebook pylab=inline clt.ipynb Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 67 / 82

Map-Reduce Skew Input data skew We can reduce impact of the skew by using fewer Reduce tasks than there are reducers If keys are sent randomly to Reduce tasks we average over value list lengths Thus, we average over the total time for each Reduce task (Central-limit Theorem) We should make sure that the sample size is large enough (n > 30) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 68 / 82

Map-Reduce Skew Input data skew We can further reduce the skew by using more Reduce tasks than there are compute nodes Long Reduce tasks might occupy a compute node fully Several shorter Reduce tasks are executed sequentially at a single compute node Thus, we average over the total time for each compute node (Central-limit Theorem) We should make sure that the sample size is large enough (n > 30) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 69 / 82

Input data skew Map-Reduce Skew Key size: sum =196524,µ =1.965,σ 2 =243.245 10 5 10 4 10 3 10 2 10 1 10 0 0 2 4 6 8 10 12 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 70 / 82

Input data skew Map-Reduce Skew 4500 Task key size: sum =196524,µ =196.524,σ 2 =25136.428 4000 3500 3000 2500 2000 1500 1000 500 0 0 200 400 600 800 1000 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 71 / 82

Input data skew Map-Reduce Skew Task key averages: µ =1.958,σ 2 =1.886 10 2 10 1 10 0 0 1 2 3 4 5 6 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 72 / 82

Input data skew Map-Reduce Skew 25000 Node key size: sum =196524,µ =19652.400,σ 2 =242116.267 Node key averages: µ =1.976,σ 2 =0.030 20000 15000 10000 5000 0 0 2 4 6 8 10 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 73 / 82

Map-Reduce Skew Input data skew IPython Notebook examples http: //kti.tugraz.at/staff/denis/courses/kddm1/mrskew.ipynb Command Line ipython notebook pylab=inline mrskew.ipynb Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 74 / 82

Map-Reduce Skew Combiners Sometimes a Reduce function is associative and commutative Commutative: x y = y x Associative: (x y) z = x (y z) The values can be combined in any order, with the same result The addition in reducer of the word count example is such an operation Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 75 / 82

Map-Reduce Skew Combiners When the Reduce function is associative and commutative we can push some of the reducers work to the Map tasks E.g. instead of emitting (w, 1), (w, 1),... We can apply the Reduce function within the Map task In that way the output of the Map task is combined before grouping and sorting Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 76 / 82

Combiners Map-Reduce Skew Map Combiner Group by key Reduce (Star, 1) (Wars, 1) (is, 1)... (American, 1) (epic, 1) (space, 1)... (franchise, 1) (centered, 1) (on, 1)... (film, 1) (series, 1) (created, 1)......... (Star, 2)... (Wars, 2) (a, 6)... (a,3)... (film, 3) (franchise, 1) (series, 2)......... (Star, 2) (Wars, 2) (a, 6) (a, 3) (a, 4)... (film, 3) (franchise, 1) (series, 2)......... (Star, 2) (Wars, 2) (a, 13) (film, 3) (franchise, 1) (series, 2)...... Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 77 / 82

Map-Reduce Skew Input data skew: Exercise Exercise Suppose we execute the word-count map-reduce program on a large repository such as a copy of the Web. We shall use 100 Map tasks and some number of Reduce tasks. 1 Do you expect there to be significant skew in the times taken by the various reducers to process their value list? Why or why not? 2 If we combine the reducers into a small number of Reduce tasks, say 10 tasks, at random, do you expect the skew to be significant? What if we instead combine the reducers into 10,000 Reduce tasks? 3 Suppose we do use a combiner at the 100 Map tasks. Do you expect skew to be significant? Why or why not? Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 78 / 82

Input data skew Map-Reduce Skew Key size: sum =195279,µ =1.953,σ 2 =83.105 10 5 10 4 10 3 10 2 10 1 10 0 0 2 4 6 8 10 12 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 79 / 82

Input data skew Map-Reduce Skew Task key averages: µ =1.793,σ 2 =10.986 10 5 10 4 10 3 10 2 10 1 10 0 0 1 2 3 4 5 6 7 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 80 / 82

Map-Reduce Skew Input data skew IPython Notebook examples http://kti.tugraz.at/staff/denis/courses/kddm1/ combiner.ipynb Command Line ipython notebook pylab=inline combiner.ipynb Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 81 / 82

Map-Reduce: Exercise Map-Reduce Skew Exercise Suppose we have an n n matrix M, whose element in row i and column j will be denoted m ij. Suppose we also have a vector v of length n, whose jth element is v j. Then the matrix-vector product is the vector x of length n, whose ith element is given by x i = n m ij v j j=1 Outline a Map-Reduce program that calculates the vector x. Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 82 / 82