Knowledge Discovery and Data Mining 1 (VO) ( )

Size: px
Start display at page:

Download "Knowledge Discovery and Data Mining 1 (VO) ( )"

Transcription

1 Knowledge Discovery and Data Mining 1 (VO) ( ) Map-Reduce Denis Helic KTI, TU Graz Oct 24, 2013 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

2 Big picture: KDDM Probability Theory Linear Algebra Map-Reduce Information Theory Statistical Inference Mathematical Tools Infrastructure Knowledge Discovery Process Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

3 Outline 1 Motivation 2 Large Scale Computation 3 Map-Reduce 4 Environment 5 Map-Reduce Skew Slides Slides are partially based on Mining Massive Datasets course from Stanford University by Jure Leskovec Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

4 Map-Reduce Motivation Today s data is huge Challenges How to distribute computation? Distributed/parallel programming is hard Map-reduce addresses both of these points Google s computational/data manipulation model Elegant way to work with huge data Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

5 Motivation Single node architecture CPU Memory Data fits in memory Machine learning, statistics Classical data mining Memory Disk Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

6 Motivation Motivation: Google example 20+ billion Web pages Approx. 20 KB per page Approx TB for the whole Web Approx hard drives to store the Web A single computer reads MB/s from disk Approx. 4 months to read the Web with a single computer Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

7 Motivation Motivation: Google example Takes even more time to do something with the data E.g. to calculate the PageRank If m is the number of the links on the Web Average degree on the Web is approx. 10, thus m To calculate PageRank we need per iteration step m multiplications We need approx iteration steps Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

8 Motivation Motivation: Google example Today a standard architecture for such problems is emerging Cluster of commodity Linux nodes Commodity network (ethernet) to connect them 2 10 Gbps between racks 1 Gbps within racks Each rack contains nodes Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

9 Cluster architecture Motivation Switch Switch Switch CPU CPU CPU CPU Memory Memory Memory Memory Memory Disk Memory Disk Memory Disk Memory Disk Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

10 Motivation Motivation: Google example 2011 estimation: Google had approx. 1 million machines report-google-uses-about servers/ Other examples: Facebook, Twitter, Amazon, etc. But also smaller examples: e.g. Wikipedia Single source shortest path: m + n time complexity, approx Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

11 Large Scale Computation Large scale computation Large scale computation for data mining on commodity hardware Challenges How to distribute computation? How can we make it easy to write distributed programs? How to cope with machine failures? Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

12 Large Scale Computation Large scale computation: machine failures One server may stay up 3 years (1000 days) The failure rate per day: p = 10 3 How many failures per day if we have n machines? Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

13 Large Scale Computation Large scale computation: machine failures One server may stay up 3 years (1000 days) The failure rate per day: p = 10 3 How many failures per day if we have n machines? Binomial r.v. Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

14 Large Scale Computation Large scale computation: machine failures PMF of a Binomial r.v. p(k) = ( ) n (1 p) n k p k k Expectation of a Binomial r.v. E[X ] = np Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

15 Large Scale Computation Large scale computation: machine failures n = 1000, E[X ] = 1 If we have 1000 machines we lose one per day n = , E[X ] = 1000 If we have 1 million machines (Google) we lose 1 thousand per day Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

16 Large Scale Computation Coping with node failures: Exercise Exercise Suppose a job consists of n tasks, each of which takes time T seconds. Thus, if there are no failures, the sum over all compute nodes of the time taken to execute tasks at that node is nt. Suppose also that the probability of a task failing is p per job per second, and when a task fails, the overhead of management of the restart is such that it adds 10T seconds to the total execution time of the job. What is the total expected execution time of the job? Example Example from Mining Massive Datasets. Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

17 Large Scale Computation Coping with node failures: Exercise Failure of a single task is a Bernoulli r.v. with parameter p The number of failures in n tasks is a Binomial with parameter n and p PMF of a Binomial r.v. p(k) = ( ) n (1 p) n k p k k Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

18 Large Scale Computation Coping with node failures: Exercise PMF The time until the first failure of a task is a Geometric r.v. with parameter p p(k) = (1 p) k 1 p Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

19 Large Scale Computation Coping with node failures: Exercise If we go to very fine scales we can approximate a Geometric r.v. with an Exponential r.v. with λ = p The time (T ) until a task fails is distributed exponentially PDF f (t; λ) = { λe λt, t 0 0, t < 0 CDF F (t; λ) = { 1 e λt, t 0 0, t < 0 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

20 Large Scale Computation Coping with node failures: Exercise Expected execution time of a task Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

21 Large Scale Computation Coping with node failures: Exercise Expected execution time of a task E[T ] = P S T S + P F (E[T L ] + T R + E[T ]) P S = 1 P F P F = 1 e λts P S = e λts Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

22 Large Scale Computation Coping with node failures: Exercise After simplifying, we get: E[T ] = T S + P F 1 P F (E[T L ] + T R ) P F 1 P F = 1 e λts e λt S = e λt S 1 (1) E[T ] = T S + (e λt S 1)(E[T L ] + T R ) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

23 Large Scale Computation Coping with node failures: Exercise E[T L ] =? What is PDF of T L T L models time lost because of failure Time is lost if and only if a failure occurs before the task finishes, i.e. we know that within [0, T S ] a failure has occurred Let this be an event B Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

24 Large Scale Computation Coping with node failures: Exercise What is P(B)? P(B) = F (T S ) = 1 e λt S Our T L is now a r.v. conditioned on event B I.e. we are interested in the probability of event A (failure occurs at time t < T S ) given that B occurred P(A) = λe λt Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

25 Large Scale Computation Coping with node failures: Exercise P(A B) = P(A B) P(B) What is A B? A: failure occurs at time t < T S B: failure occurs within [0, T S ] A B: failure occurs at time t < T S A B = A Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

26 Large Scale Computation Coping with node failures: Exercise P(A B) = P(A) P(B) PDF of a r.v. T L f (t) = λe λt 1 e λt S Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

27 Large Scale Computation Coping with node failures: Exercise Expectation E[T L ]: E[T L ] = tf (t)dt E[T L ] = = TS 0 λe λt t 1 e λt dt S 1 1 e λt S TS 0 tλe λt dt (2) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

28 Large Scale Computation Coping with node failures: Exercise E[T L ] = 1 1 e λt S = 1 λ T S e λt S 1 ] [ te λt e λt T S λ 0 E[T ] = T S + (e λt S 1)( 1 λ T S e λt S 1 + T R ) = (e λt S 1)( 1 λ + T R) (3) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

29 Large Scale Computation Coping with node failures: Exercise For a single task: E[T ] = (e pt 1)( 1 p + 10T ) For n tasks: E[T ] = n(e pt 1)( 1 p + 10T ) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

30 Large Scale Computation Coordination Using this information we can improve scheduling We can also optimize checking for node failures Check-pointing strategies Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

31 Large Scale Computation Large scale computation: data copying Copying data over network takes time Bring data closer to computation I.e. process data locally at each node Replicate data to increase reliability Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

32 Solution: Map-reduce Map-Reduce Storage infrastructure Distributed file system Google: GFS Hadoop: HDFS Programming model Map-reduce Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

33 Storage infrastructure Map-Reduce Problem: if node fails how to store a file persistently Distributed file system Provides global file namespace Typical usage pattern Huge files: several 100s GB to 1 TB Data is rarely updated in place Reads and appends are common Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

34 Distributed file system Map-Reduce Chunk servers File is split into contiguous chunks Typically each chunk is MB Each chunk is replicated (usually 2x or 3x) Try to keep replicas in different racks Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

35 Distributed file system Map-Reduce Master node Stores metadata about where files are stored Might be replicated Client library for file access Talks to master node to find chunk servers Connects directly to chunk servers to access data Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

36 Distributed file system Map-Reduce Reliable distributed file system Data kept in chunks spread across machines Each chunk replicated on different machines Reliable distributed file system Seamless recovery from node failures Bring computation directly to data Chunk servers also used as computation nodes Seamless recovery from disk or machine failure C 0 C 1 C 5 C 2 D 0 C 5 C 1 C 3 C 2 D 0 C 5 D 1 C 0 C 5 D 0 C 2 Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N Bring computation directly to the data! Figure: Figure from slides by Jure Leskovec Chunk servers also serve as compute servers 1/8/2013 Denis Helic (KTI, TU Graz) Jure Leskovec, Stanford CS246: Mining KDDM1 Massive Datasets, Oct 24, / 8233

37 Map-Reduce Programming model: Map-reduce Running example We want to count the number of occurrences for each word in a collection of documents. In this example, the input file is a repository of documents, and each document is an element. Example Example is meanwhile a standard Map-reduce example. Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

38 Map-Reduce Programming model: Map-reduce words input_file sort uniq -c Three step process 1 Split file into words, each word on a separate line 2 Group and sort all words 3 Count the occurrences Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

39 Map-Reduce Programming model: Map-reduce This captures the essence of Map-reduce Split Group Count Naturally parallelizable E.g. split and count Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

40 Map-Reduce Programming model: Map-reduce Sequentially read a lot of data Map: extract something that you care about (key, value) Group by key: sort and shuffle Reduce: Aggregate, summarize, filter or transform Write the result Outline Outline is always the same: Map and Reduce change to fit the problem Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

41 tasks, so all key-value Map-Reduce pairs with the same key wind up at the same Reduce task. Programming model: Map-reduce 3. The Reduce tasks work on one key at a time, and combine all the values associated with that key in some way. The manner of combination of values is determined by the code written by the user for the Reduce function. Figure 2.2 suggests this computation. Input chunks Key value pairs (k,v) Keys with all their values (k, [v, w,...]) Combined output Map tasks Group by keys Reduce tasks Figure 2.2: Schematic of a map-reduce computation Figure: Figure from the book: Mining massive datasets The Map Tasks We view input files for a Map task as consisting of elements, which can be any type: a tuple or a document, for example. A chunk is a collection of elements, and no element is stored across two chunks. Technically, all inputs Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

42 Map-Reduce Programming model: Map-reduce Input: a set of (key, value) pairs (e.g. key is the filename, value is a single line in the file) Map(k, v) (k, v ) Takes a (k, v) pair and outputs a set of (k, v ) pairs There is one Map call for each (k, v) pair Reduce(k, (v ) ) (k, v ) All values v with same key k are reduced together and processed in v order There is one Reduce call for each unique k Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

43 Map-Reduce Programming model: Map-reduce Big Document Map Group by key Reduce Star Wars is an American epic space opera franchise centered on a film series created by George Lucas. The film series has spawned an extensive media franchise called the Expanded Universe including books, television series, computer and video games, and comic books. These supplements to the two film trilogies... (Star, 1) (Wars, 1) (is, 1) (an, 1) (American, 1) (epic, 1) (space, 1) (opera, 1) (franchise, 1) (centered, 1) (on, 1) (a, 1) (film, 1) (series, 1) (created, 1) (by, 1) (Star, 1) (Star, 1) (Wars, 1) (Wars, 1) (a, 1) (a, 1) (a, 1) (a, 1) (a, 1) (a, 1) (film, 1) (film, 1) (film, 1) (franchise, 1) (series, 1) (series, 1) (Star, 2) (Wars, 2) (a, 6) (film, 3) (franchise, 1) (series, 2) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

44 Map-Reduce Programming model: Map-reduce map(key, value): // key: document name // value: a single line from a document foreach word w in value: emit(w, 1) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

45 Map-Reduce Programming model: Map-reduce reduce(key, values): // key: a word // values: an iterator over counts result = 0 foreach count c in values: result += c emit(key, result) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

46 Map-reduce computation Environment Map-reduce environment takes care of: 1 Partitioning the input data 2 Scheduling the program s execution across a set of machines 3 Performing the group by key step 4 Handling machine failures 5 Managing required inter-machine communication Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

47 Map-reduce computation Environment MAP: Read input and produces a set of key-value pairs Big document Group by key: Collect all pairs with same key (Hash merge, Shuffle, Sort, Partition) Reduce: Collect all values belonging to the key and output 1/8/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 43 Figure: Figure from the course by Jure Leskovec (Stanford University) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

48 Map-reduce computation Environment All phases are distributed with many tasks doing the work Figure: Figure from the course by Jure Leskovec (Stanford University) 1/8/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 44 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

49 Environment Data flow Input and final output are stored in distributed file system Scheduler tries to schedule map tasks close to physical storage location of input data Intermediate results are stored on local file systems of Map and Reduce workers Output is often input to another Map-reduce computation Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

50 Environment Coordination: Master Master node takes care of coordination Task status, e.g. idle, in-progress, completed Idle tasks get scheduled as workers become available When a Map task completes, it notifies the master about the size and location of its intermediate files Master pushes this info to reducers Master pings workers periodically to detect failures Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

51 Environment Map-reduce execution details 2.2. MAP-REDUCE 27 User Program fork fork Master fork assign Map assign Reduce Worker Worker Worker Worker Input Data Worker Intermediate Files Output File Figure 2.3: Overview of the execution of a map-reduce program Figure: Figure from the book: Mining massive datasets executing at a particular Worker, or completed). A Worker process reports to the Master when it finishes a task, and a new task is scheduled by the Master for that Worker process. Each Map task is assigned one or more chunks of the input file(s) and executes on it the code written by the user. The Map task creates a file for Denis Helic (KTI, TUeach Graz) Reduce task on the local disk KDDM1 of the Worker that executes the Map task. Oct 24, / 82

52 Maximizing parallelism Map-Reduce Skew If we want maximum parallelism then Use one Reduce task for each reducer (i.e. a single key and its associated value list) Execute each Reduce task at a different compute node The plan is typically not the best one Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

53 Map-Reduce Skew Maximizing parallelism There is overhead associated with each task we create We might want to keep the number of Reduce tasks lower than the number of different keys We do not want to create a task for a key with a short list There are often far more keys than there are compute nodes E.g. count words from Wikipedia or from the Web Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

54 Map-Reduce Skew Input data skew: Exercise Exercise Suppose we execute the word-count map-reduce program on a large repository such as a copy of the Web. We shall use 100 Map tasks and some number of Reduce tasks. 1 Do you expect there to be significant skew in the times taken by the various reducers to process their value list? Why or why not? 2 If we combine the reducers into a small number of Reduce tasks, say 10 tasks, at random, do you expect the skew to be significant? What if we instead combine the reducers into 10,000 Reduce tasks? Example Example is based on the example from Mining Massive Datasets. Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

55 Maximizing parallelism Map-Reduce Skew There is often significant variation in the lengths of value list for different keys Different reducers take different amounts of time to finish If we make each reducer a separate Reduce task then the task execution times will exhibit significant variance Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

56 Map-Reduce Skew Input data skew Input data skew describes an uneven distribution of the number of values per key Examples include power-law graphs, e.g. the Web or Wikipedia Other data with Zipfian distribution E.g. the number of word occurrences Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

57 Map-Reduce Skew Power-law (Zipf) random variable PMF p(k) = k α ζ(α) k N, k 1, α > 1 ζ(α) is the Riemann zeta function ζ(α) = k=1 k α Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

58 Map-Reduce Skew Power-law (Zipf) random variable Probability mass function of a Zipf random variable; differing α values 0.9 α = α = probability of k k Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

59 Map-Reduce Skew Power-law (Zipf) input data Key size: sum =189681,µ =1.897,σ 2 = Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

60 Map-Reduce Skew Tackling input data skew We need to distribute a skewed (power-law) input data into a number of reducers/reduce tasks/compute nodes The distribution of the key lengths inside of reducers/reduce tasks/compute nodes should be approximately normal The variance of these distributions should be smaller than the original variance If variance is small an efficient load balancing is possible Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

61 Map-Reduce Skew Tackling input data skew Each Reduce task receives a number of keys The total number of values to process is the sum of the number of values over all keys The average number of values that a Reduce task processes is the average of the number of values over all keys Equivalently, each compute node receives a number of Reduce tasks The sum and average for a compute node is the sum and average over all Reduce tasks for that node Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

62 Map-Reduce Skew Tackling input data skew How should we distribute keys to Reduce tasks? Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

63 Map-Reduce Skew Tackling input data skew How should we distribute keys to Reduce tasks? Uniformly at random Other possibilities? Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

64 Map-Reduce Skew Tackling input data skew How should we distribute keys to Reduce tasks? Uniformly at random Other possibilities? Calculate the capacity of a single Reduce task Add keys until capacity is reached, etc. Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

65 Map-Reduce Skew Tackling input data skew We are averaging over a skewed distribution Are there laws that describe how the averages of sufficiently large samples drawn from a probability distribution behaves? In other words, how are the averages of samples of a r.v. distributed? Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

66 Map-Reduce Skew Tackling input data skew We are averaging over a skewed distribution Are there laws that describe how the averages of sufficiently large samples drawn from a probability distribution behaves? In other words, how are the averages of samples of a r.v. distributed? Central-limit Theorem Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

67 Map-Reduce Skew Central-Limit Theorem The central-limit theorem describes the distribution of the arithmetic mean of sufficiently large samples of independent and identically distributed random variables The means are normally distributed The mean of the new distribution equals the mean of the original distribution The variance of the new distribution equals σ2 n, where σ2 is the variance of the original distribution Thus, we keep the mean and reduce the variance Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

68 Central-Limit Theorem Map-Reduce Skew Theorem Suppose X 1,..., X n are independent and identical r.v. with the expectation µ and variance σ 2. Let Y be a r.v. defined as: Y n = 1 n n i=1 X i The CDF F n (y) tends to PDF of a normal r.v. with the mean µ and variance σ 2 for n : lim F 1 y n(y) = e (x µ)2 2σ 2 n 2πσ 2 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

69 Map-Reduce Skew Central-Limit Theorem Example Practically, it is possible to replace F n (y) with a normal distribution for n > 30 We should always average over at least 30 values Approximating uniform r.v. with a normal r.v. by sampling and averaging Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

70 Central-Limit Theorem Map-Reduce Skew 18 Averages: µ =0.5,σ 2 = Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

71 Central-Limit Theorem Map-Reduce Skew 40 Averages: µ =0.499,σ 2 = Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

72 Map-Reduce Skew Central-Limit Theorem IPython Notebook examples http: //kti.tugraz.at/staff/denis/courses/kddm1/clt.ipynb Command Line ipython notebook pylab=inline clt.ipynb Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

73 Map-Reduce Skew Input data skew We can reduce impact of the skew by using fewer Reduce tasks than there are reducers If keys are sent randomly to Reduce tasks we average over value list lengths Thus, we average over the total time for each Reduce task (Central-limit Theorem) We should make sure that the sample size is large enough (n > 30) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

74 Map-Reduce Skew Input data skew We can further reduce the skew by using more Reduce tasks than there are compute nodes Long Reduce tasks might occupy a compute node fully Several shorter Reduce tasks are executed sequentially at a single compute node Thus, we average over the total time for each compute node (Central-limit Theorem) We should make sure that the sample size is large enough (n > 30) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

75 Input data skew Map-Reduce Skew Key size: sum =196524,µ =1.965,σ 2 = Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

76 Input data skew Map-Reduce Skew 4500 Task key size: sum =196524,µ = ,σ 2 = Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

77 Input data skew Map-Reduce Skew Task key averages: µ =1.958,σ 2 = Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

78 Input data skew Map-Reduce Skew Node key size: sum =196524,µ = ,σ 2 = Node key averages: µ =1.976,σ 2 = Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

79 Map-Reduce Skew Input data skew IPython Notebook examples http: //kti.tugraz.at/staff/denis/courses/kddm1/mrskew.ipynb Command Line ipython notebook pylab=inline mrskew.ipynb Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

80 Map-Reduce Skew Combiners Sometimes a Reduce function is associative and commutative Commutative: x y = y x Associative: (x y) z = x (y z) The values can be combined in any order, with the same result The addition in reducer of the word count example is such an operation Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

81 Map-Reduce Skew Combiners When the Reduce function is associative and commutative we can push some of the reducers work to the Map tasks E.g. instead of emitting (w, 1), (w, 1),... We can apply the Reduce function within the Map task In that way the output of the Map task is combined before grouping and sorting Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

82 Combiners Map-Reduce Skew Map Combiner Group by key Reduce (Star, 1) (Wars, 1) (is, 1)... (American, 1) (epic, 1) (space, 1)... (franchise, 1) (centered, 1) (on, 1)... (film, 1) (series, 1) (created, 1) (Star, 2)... (Wars, 2) (a, 6)... (a,3)... (film, 3) (franchise, 1) (series, 2) (Star, 2) (Wars, 2) (a, 6) (a, 3) (a, 4)... (film, 3) (franchise, 1) (series, 2) (Star, 2) (Wars, 2) (a, 13) (film, 3) (franchise, 1) (series, 2) Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

83 Map-Reduce Skew Input data skew: Exercise Exercise Suppose we execute the word-count map-reduce program on a large repository such as a copy of the Web. We shall use 100 Map tasks and some number of Reduce tasks. 1 Do you expect there to be significant skew in the times taken by the various reducers to process their value list? Why or why not? 2 If we combine the reducers into a small number of Reduce tasks, say 10 tasks, at random, do you expect the skew to be significant? What if we instead combine the reducers into 10,000 Reduce tasks? 3 Suppose we do use a combiner at the 100 Map tasks. Do you expect skew to be significant? Why or why not? Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

84 Input data skew Map-Reduce Skew Key size: sum =195279,µ =1.953,σ 2 = Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

85 Input data skew Map-Reduce Skew Task key averages: µ =1.793,σ 2 = Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

86 Map-Reduce Skew Input data skew IPython Notebook examples combiner.ipynb Command Line ipython notebook pylab=inline combiner.ipynb Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

87 Map-Reduce: Exercise Map-Reduce Skew Exercise Suppose we have an n n matrix M, whose element in row i and column j will be denoted m ij. Suppose we also have a vector v of length n, whose jth element is v j. Then the matrix-vector product is the vector x of length n, whose ith element is given by x i = n m ij v j j=1 Outline a Map-Reduce program that calculates the vector x. Denis Helic (KTI, TU Graz) KDDM1 Oct 24, / 82

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Computational Frameworks. MapReduce

Computational Frameworks. MapReduce Computational Frameworks MapReduce 1 Computational complexity: Big data challenges Any processing requiring a superlinear number of operations may easily turn out unfeasible. If input size is really huge,

More information

Computational Frameworks. MapReduce

Computational Frameworks. MapReduce Computational Frameworks MapReduce 1 Computational challenges in data mining Computation-intensive algorithms: e.g., optimization algorithms, graph algorithms Large inputs: e.g., web, social networks data.

More information

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second semester

More information

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first semester

More information

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 11: MapReduce Motivation Distribution makes simple computations complex Communication Load balancing Fault tolerance Not all applications

More information

A Tale of Two Erasure Codes in HDFS

A Tale of Two Erasure Codes in HDFS A Tale of Two Erasure Codes in HDFS Dynamo Mingyuan Xia *, Mohit Saxena +, Mario Blaum +, and David A. Pease + * McGill University, + IBM Research Almaden FAST 15 何军权 2015-04-30 1 Outline Introduction

More information

Query Analyzer for Apache Pig

Query Analyzer for Apache Pig Imperial College London Department of Computing Individual Project: Final Report Query Analyzer for Apache Pig Author: Robert Yau Zhou 00734205 (robert.zhou12@imperial.ac.uk) Supervisor: Dr Peter McBrien

More information

How to deal with uncertainties and dynamicity?

How to deal with uncertainties and dynamicity? How to deal with uncertainties and dynamicity? http://graal.ens-lyon.fr/ lmarchal/scheduling/ 19 novembre 2012 1/ 37 Outline 1 Sensitivity and Robustness 2 Analyzing the sensitivity : the case of Backfilling

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Sample Examination Questions Denis Helic KTI, TU Graz Jan 16, 2014 Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 1 / 22 Exercise Suppose we have a utility

More information

Poisson Chris Piech CS109, Stanford University. Piech, CS106A, Stanford University

Poisson Chris Piech CS109, Stanford University. Piech, CS106A, Stanford University Poisson Chris Piech CS109, Stanford University Piech, CS106A, Stanford University Probability for Extreme Weather? Piech, CS106A, Stanford University Four Prototypical Trajectories Review Binomial Random

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Review of Linear Algebra Denis Helic KTI, TU Graz Oct 9, 2014 Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 1 / 74 Big picture: KDDM Probability Theory

More information

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

More information

Distributed Architectures

Distributed Architectures Distributed Architectures Software Architecture VO/KU (707023/707024) Roman Kern KTI, TU Graz 2015-01-21 Roman Kern (KTI, TU Graz) Distributed Architectures 2015-01-21 1 / 64 Outline 1 Introduction 2 Independent

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Probabilistic Latent Semantic Analysis Denis Helic KTI, TU Graz Jan 16, 2014 Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 1 / 47 Big picture: KDDM

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu What is the structure of the Web? How is it organized? 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 12: Real-Time Data Analytics (2/2) March 31, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74 V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:

More information

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qin 1, Martin Torres 2, and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced August 31, 2017 Machine

More information

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets IEEE Big Data 2015 Big Data in Geosciences Workshop An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets Fatih Akdag and Christoph F. Eick Department of Computer

More information

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Northwestern University Department of Electrical Engineering and Computer Science

Northwestern University Department of Electrical Engineering and Computer Science Northwestern University Department of Electrical Engineering and Computer Science EECS 454: Modeling and Analysis of Communication Networks Spring 2008 Probability Review As discussed in Lecture 1, probability

More information

IR: Information Retrieval

IR: Information Retrieval / 44 IR: Information Retrieval FIB, Master in Innovation and Research in Informatics Slides by Marta Arias, José Luis Balcázar, Ramon Ferrer-i-Cancho, Ricard Gavaldá Department of Computer Science, UPC

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

PI SERVER 2012 Do. More. Faster. Now! Copyr i g h t 2012 O S Is o f t, L L C. 1

PI SERVER 2012 Do. More. Faster. Now! Copyr i g h t 2012 O S Is o f t, L L C. 1 PI SERVER 2012 Do. More. Faster. Now! Copyr i g h t 2012 O S Is o f t, L L C. 1 AUGUST 7, 2007 APRIL 14, 2010 APRIL 24, 2012 Copyr i g h t 2012 O S Is o f t, L L C. 2 PI Data Archive Security PI Asset

More information

Spatial Analytics Workshop

Spatial Analytics Workshop Spatial Analytics Workshop Pete Skomoroch, LinkedIn (@peteskomoroch) Kevin Weil, Twitter (@kevinweil) Sean Gorman, FortiusOne (@seangorman) #spatialanalytics Introduction The Rise of Spatial Analytics

More information

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai ECE521 W17 Tutorial 1 Renjie Liao & Min Bai Schedule Linear Algebra Review Matrices, vectors Basic operations Introduction to TensorFlow NumPy Computational Graphs Basic Examples Linear Algebra Review

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

Using R for Iterative and Incremental Processing

Using R for Iterative and Incremental Processing Using R for Iterative and Incremental Processing Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Robert Schreiber UC Berkeley and HP Labs UC BERKELEY Big Data, Complex Algorithms PageRank (Dominant

More information

High-Performance Scientific Computing

High-Performance Scientific Computing High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org

More information

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models Fatih Cavdur fatihcavdur@uludag.edu.tr March 20, 2012 Introduction Introduction The world of the model-builder

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 Web pages are not equally important www.joe-schmoe.com

More information

BMIR Lecture Series on Probability and Statistics Fall, 2015 Uniform Distribution

BMIR Lecture Series on Probability and Statistics Fall, 2015 Uniform Distribution Lecture #5 BMIR Lecture Series on Probability and Statistics Fall, 2015 Department of Biomedical Engineering and Environmental Sciences National Tsing Hua University s 5.1 Definition ( ) A continuous random

More information

Large-Scale Behavioral Targeting

Large-Scale Behavioral Targeting Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting

More information

Discrete-event simulations

Discrete-event simulations Discrete-event simulations Lecturer: Dmitri A. Moltchanov E-mail: moltchan@cs.tut.fi http://www.cs.tut.fi/kurssit/elt-53606/ OUTLINE: Why do we need simulations? Step-by-step simulations; Classifications;

More information

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Chengjie Qin 1 and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced June 29, 2017 Machine Learning (ML) Is

More information

Progressive & Algorithms & Systems

Progressive & Algorithms & Systems University of California Merced Lawrence Berkeley National Laboratory Progressive Computation for Data Exploration Progressive Computation Online Aggregation (OLA) in DB Query Result Estimate Result ε

More information

Ching-Han Hsu, BMES, National Tsing Hua University c 2015 by Ching-Han Hsu, Ph.D., BMIR Lab. = a + b 2. b a. x a b a = 12

Ching-Han Hsu, BMES, National Tsing Hua University c 2015 by Ching-Han Hsu, Ph.D., BMIR Lab. = a + b 2. b a. x a b a = 12 Lecture 5 Continuous Random Variables BMIR Lecture Series in Probability and Statistics Ching-Han Hsu, BMES, National Tsing Hua University c 215 by Ching-Han Hsu, Ph.D., BMIR Lab 5.1 1 Uniform Distribution

More information

Exploring Human Mobility with Multi-Source Data at Extremely Large Metropolitan Scales. ACM MobiCom 2014, Maui, HI

Exploring Human Mobility with Multi-Source Data at Extremely Large Metropolitan Scales. ACM MobiCom 2014, Maui, HI Exploring Human Mobility with Multi-Source Data at Extremely Large Metropolitan Scales Desheng Zhang & Tian He University of Minnesota, USA Jun Huang, Ye Li, Fan Zhang, Chengzhong Xu Shenzhen Institute

More information

a zoo of (discrete) random variables

a zoo of (discrete) random variables a zoo of (discrete) random variables 42 uniform random variable Takes each possible value, say {1..n} with equal probability. Say random variable uniform on S Recall envelopes problem on homework... Randomization

More information

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan ArcGIS GeoAnalytics Server: An Introduction Sarah Ambrose and Ravi Narayanan Overview Introduction Demos Analysis Concepts using GeoAnalytics Server GeoAnalytics Data Sources GeoAnalytics Server Administration

More information

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit: Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify

More information

Exercises Solutions. Automation IEA, LTH. Chapter 2 Manufacturing and process systems. Chapter 5 Discrete manufacturing problems

Exercises Solutions. Automation IEA, LTH. Chapter 2 Manufacturing and process systems. Chapter 5 Discrete manufacturing problems Exercises Solutions Note, that we have not formulated the answers for all the review questions. You will find the answers for many questions by reading and reflecting about the text in the book. Chapter

More information

Multi-join Query Evaluation on Big Data Lecture 2

Multi-join Query Evaluation on Big Data Lecture 2 Multi-join Query Evaluation on Big Data Lecture 2 Dan Suciu March, 2015 Dan Suciu Multi-Joins Lecture 2 March, 2015 1 / 34 Multi-join Query Evaluation Outline Part 1 Optimal Sequential Algorithms. Thursday

More information

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

STAT2201. Analysis of Engineering & Scientific Data. Unit 3 STAT2201 Analysis of Engineering & Scientific Data Unit 3 Slava Vaisman The University of Queensland School of Mathematics and Physics What we learned in Unit 2 (1) We defined a sample space of a random

More information

INFO 2950 Intro to Data Science. Lecture 18: Power Laws and Big Data

INFO 2950 Intro to Data Science. Lecture 18: Power Laws and Big Data INFO 2950 Intro to Data Science Lecture 18: Power Laws and Big Data Paul Ginsparg Cornell University, Ithaca, NY 7 Apr 2016 1/25 Power Laws in log-log space y = cx k (k=1/2,1,2) log 10 y = k log 10 x +log

More information

CS224W: Analysis of Networks Jure Leskovec, Stanford University

CS224W: Analysis of Networks Jure Leskovec, Stanford University CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu 10/30/17 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

More information

Basics of Stochastic Modeling: Part II

Basics of Stochastic Modeling: Part II Basics of Stochastic Modeling: Part II Continuous Random Variables 1 Sandip Chakraborty Department of Computer Science and Engineering, INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR August 10, 2016 1 Reference

More information

Slides 8: Statistical Models in Simulation

Slides 8: Statistical Models in Simulation Slides 8: Statistical Models in Simulation Purpose and Overview The world the model-builder sees is probabilistic rather than deterministic: Some statistical model might well describe the variations. An

More information

The conceptual view. by Gerrit Muller University of Southeast Norway-NISE

The conceptual view. by Gerrit Muller University of Southeast Norway-NISE by Gerrit Muller University of Southeast Norway-NISE e-mail: gaudisite@gmail.com www.gaudisite.nl Abstract The purpose of the conceptual view is described. A number of methods or models is given to use

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

NICTA Short Course. Network Analysis. Vijay Sivaraman. Day 1 Queueing Systems and Markov Chains. Network Analysis, 2008s2 1-1

NICTA Short Course. Network Analysis. Vijay Sivaraman. Day 1 Queueing Systems and Markov Chains. Network Analysis, 2008s2 1-1 NICTA Short Course Network Analysis Vijay Sivaraman Day 1 Queueing Systems and Markov Chains Network Analysis, 2008s2 1-1 Outline Why a short course on mathematical analysis? Limited current course offering

More information

Behavioral Simulations in MapReduce

Behavioral Simulations in MapReduce Behavioral Simulations in MapReduce Guozhang Wang, Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, Walker White Cornell University 1 What are Behavioral Simulations?

More information

EE126: Probability and Random Processes

EE126: Probability and Random Processes EE126: Probability and Random Processes Lecture 18: Poisson Process Abhay Parekh UC Berkeley March 17, 2011 1 1 Review 2 Poisson Process 2 Bernoulli Process An arrival process comprised of a sequence of

More information

Math Review Sheet, Fall 2008

Math Review Sheet, Fall 2008 1 Descriptive Statistics Math 3070-5 Review Sheet, Fall 2008 First we need to know about the relationship among Population Samples Objects The distribution of the population can be given in one of the

More information

Summarizing Measured Data

Summarizing Measured Data Summarizing Measured Data 12-1 Overview Basic Probability and Statistics Concepts: CDF, PDF, PMF, Mean, Variance, CoV, Normal Distribution Summarizing Data by a Single Number: Mean, Median, and Mode, Arithmetic,

More information

EECS 126 Probability and Random Processes University of California, Berkeley: Fall 2014 Kannan Ramchandran November 13, 2014.

EECS 126 Probability and Random Processes University of California, Berkeley: Fall 2014 Kannan Ramchandran November 13, 2014. EECS 126 Probability and Random Processes University of California, Berkeley: Fall 2014 Kannan Ramchandran November 13, 2014 Midterm Exam 2 Last name First name SID Rules. DO NOT open the exam until instructed

More information

STATISTICAL PERFORMANCE

STATISTICAL PERFORMANCE STATISTICAL PERFORMANCE PROVISIONING AND ENERGY EFFICIENCY IN DISTRIBUTED COMPUTING SYSTEMS Nikzad Babaii Rizvandi 1 Supervisors: Prof.Albert Y.Zomaya Prof. Aruna Seneviratne OUTLINE Introduction Background

More information

CPU Scheduling. CPU Scheduler

CPU Scheduling. CPU Scheduler CPU Scheduling These slides are created by Dr. Huang of George Mason University. Students registered in Dr. Huang s courses at GMU can make a single machine readable copy and print a single copy of each

More information

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models Fatih Cavdur fatihcavdur@uludag.edu.tr March 29, 2014 Introduction Introduction The world of the model-builder

More information

EE/CpE 345. Modeling and Simulation. Fall Class 5 September 30, 2002

EE/CpE 345. Modeling and Simulation. Fall Class 5 September 30, 2002 EE/CpE 345 Modeling and Simulation Class 5 September 30, 2002 Statistical Models in Simulation Real World phenomena of interest Sample phenomena select distribution Probabilistic, not deterministic Model

More information

Part I Stochastic variables and Markov chains

Part I Stochastic variables and Markov chains Part I Stochastic variables and Markov chains Random variables describe the behaviour of a phenomenon independent of any specific sample space Distribution function (cdf, cumulative distribution function)

More information

Continuous-time Markov Chains

Continuous-time Markov Chains Continuous-time Markov Chains Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ October 23, 2017

More information

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions CS246: Mining Massive Data Sets Winter 2017 Problem Set 2 Due 11:59pm February 9, 2017 Only one late period is allowed for this homework (11:59pm 2/14). General Instructions Submission instructions: These

More information

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 04 Basic Statistics Part-1 (Refer Slide Time: 00:33)

More information

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr.

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr. Simulation Discrete-Event System Simulation Chapter 4 Statistical Models in Simulation Purpose & Overview The world the model-builder sees is probabilistic rather than deterministic. Some statistical model

More information

Chapter Learning Objectives. Probability Distributions and Probability Density Functions. Continuous Random Variables

Chapter Learning Objectives. Probability Distributions and Probability Density Functions. Continuous Random Variables Chapter 4: Continuous Random Variables and Probability s 4-1 Continuous Random Variables 4-2 Probability s and Probability Density Functions 4-3 Cumulative Functions 4-4 Mean and Variance of a Continuous

More information

Revisiting Memory Errors in Large-Scale Production Data Centers

Revisiting Memory Errors in Large-Scale Production Data Centers Revisiting Memory Errors in Large-Scale Production Da Centers Analysis and Modeling of New Trends from the Field Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Overview Study of DRAM reliability: on modern

More information

Visualizing Big Data on Maps: Emerging Tools and Techniques. Ilir Bejleri, Sanjay Ranka

Visualizing Big Data on Maps: Emerging Tools and Techniques. Ilir Bejleri, Sanjay Ranka Visualizing Big Data on Maps: Emerging Tools and Techniques Ilir Bejleri, Sanjay Ranka Topics Web GIS Visualization Big Data GIS Performance Maps in Data Visualization Platforms Next: Web GIS Visualization

More information

Chapter 5. Statistical Models in Simulations 5.1. Prof. Dr. Mesut Güneş Ch. 5 Statistical Models in Simulations

Chapter 5. Statistical Models in Simulations 5.1. Prof. Dr. Mesut Güneş Ch. 5 Statistical Models in Simulations Chapter 5 Statistical Models in Simulations 5.1 Contents Basic Probability Theory Concepts Discrete Distributions Continuous Distributions Poisson Process Empirical Distributions Useful Statistical Models

More information

Brief Review of Probability

Brief Review of Probability Maura Department of Economics and Finance Università Tor Vergata Outline 1 Distribution Functions Quantiles and Modes of a Distribution 2 Example 3 Example 4 Distributions Outline Distribution Functions

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018 Lab 8: Measuring Graph Centrality - PageRank Monday, November 5 CompSci 531, Fall 2018 Outline Measuring Graph Centrality: Motivation Random Walks, Markov Chains, and Stationarity Distributions Google

More information

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University TheFind.com Large set of products (~6GB compressed) For each product A=ributes Related products Craigslist About 3 weeks of data

More information

Math Spring Practice for the final Exam.

Math Spring Practice for the final Exam. Math 4 - Spring 8 - Practice for the final Exam.. Let X, Y, Z be three independnet random variables uniformly distributed on [, ]. Let W := X + Y. Compute P(W t) for t. Honors: Compute the CDF function

More information

Computer Algorithms CISC4080 CIS, Fordham Univ. Outline. Last class. Instructor: X. Zhang Lecture 2

Computer Algorithms CISC4080 CIS, Fordham Univ. Outline. Last class. Instructor: X. Zhang Lecture 2 Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2 Outline Introduction to algorithm analysis: fibonacci seq calculation counting number of computer steps recursive formula

More information

Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2

Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2 Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2 Outline Introduction to algorithm analysis: fibonacci seq calculation counting number of computer steps recursive formula

More information

BINOMIAL DISTRIBUTION

BINOMIAL DISTRIBUTION BINOMIAL DISTRIBUTION The binomial distribution is a particular type of discrete pmf. It describes random variables which satisfy the following conditions: 1 You perform n identical experiments (called

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

Simulating events: the Poisson process

Simulating events: the Poisson process Simulating events: te Poisson process p. 1/15 Simulating events: te Poisson process Micel Bierlaire micel.bierlaire@epfl.c Transport and Mobility Laboratory Simulating events: te Poisson process p. 2/15

More information

Midterm Exam 1 (Solutions)

Midterm Exam 1 (Solutions) EECS 6 Probability and Random Processes University of California, Berkeley: Spring 07 Kannan Ramchandran February 3, 07 Midterm Exam (Solutions) Last name First name SID Name of student on your left: Name

More information

Algorithms and Data S tructures Structures Complexity Complexit of Algorithms Ulf Leser

Algorithms and Data S tructures Structures Complexity Complexit of Algorithms Ulf Leser Algorithms and Data Structures Complexity of Algorithms Ulf Leser Content of this Lecture Efficiency of Algorithms Machine Model Complexity Examples Multiplication of two binary numbers (unit cost?) Exact

More information

Stochastic Modelling of Electron Transport on different HPC architectures

Stochastic Modelling of Electron Transport on different HPC architectures Stochastic Modelling of Electron Transport on different HPC architectures www.hp-see.eu E. Atanassov, T. Gurov, A. Karaivan ova Institute of Information and Communication Technologies Bulgarian Academy

More information

Optimization and Simulation

Optimization and Simulation Optimization and Simulation Simulating events: the Michel Bierlaire Transport and Mobility Laboratory School of Architecture, Civil and Environmental Engineering Ecole Polytechnique Fédérale de Lausanne

More information

Design of discrete-event simulations

Design of discrete-event simulations Design of discrete-event simulations Lecturer: Dmitri A. Moltchanov E-mail: moltchan@cs.tut.fi http://www.cs.tut.fi/kurssit/tlt-2707/ OUTLINE: Discrete event simulation; Event advance design; Unit-time

More information

Estimates for factoring 1024-bit integers. Thorsten Kleinjung, University of Bonn

Estimates for factoring 1024-bit integers. Thorsten Kleinjung, University of Bonn Estimates for factoring 1024-bit integers Thorsten Kleinjung, University of Bonn Contents GNFS Overview Polynomial selection, matrix construction, square root computation Sieving and cofactoring Strategies

More information

Lecture 08: Poisson and More. Lisa Yan July 13, 2018

Lecture 08: Poisson and More. Lisa Yan July 13, 2018 Lecture 08: Poisson and More Lisa Yan July 13, 2018 Announcements PS1: Grades out later today Solutions out after class today PS2 due today PS3 out today (due next Friday 7/20) 2 Midterm announcement Tuesday,

More information

Chapter 3. Discrete Random Variables and Their Probability Distributions

Chapter 3. Discrete Random Variables and Their Probability Distributions Chapter 3. Discrete Random Variables and Their Probability Distributions 1 3.4-3 The Binomial random variable The Binomial random variable is related to binomial experiments (Def 3.6) 1. The experiment

More information

Online Scheduling Switch for Maintaining Data Freshness in Flexible Real-Time Systems

Online Scheduling Switch for Maintaining Data Freshness in Flexible Real-Time Systems Online Scheduling Switch for Maintaining Data Freshness in Flexible Real-Time Systems Song Han 1 Deji Chen 2 Ming Xiong 3 Aloysius K. Mok 1 1 The University of Texas at Austin 2 Emerson Process Management

More information

SUMMARIZING MEASURED DATA. Gaia Maselli

SUMMARIZING MEASURED DATA. Gaia Maselli SUMMARIZING MEASURED DATA Gaia Maselli maselli@di.uniroma1.it Computer Network Performance 2 Overview Basic concepts Summarizing measured data Summarizing data by a single number Summarizing variability

More information

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory

More information

Lecture 5: Web Searching using the SVD

Lecture 5: Web Searching using the SVD Lecture 5: Web Searching using the SVD Information Retrieval Over the last 2 years the number of internet users has grown exponentially with time; see Figure. Trying to extract information from this exponentially

More information

Chapter 4 Continuous Random Variables and Probability Distributions

Chapter 4 Continuous Random Variables and Probability Distributions Chapter 4 Continuous Random Variables and Probability Distributions Part 3: The Exponential Distribution and the Poisson process Section 4.8 The Exponential Distribution 1 / 21 Exponential Distribution

More information

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On Two Class-Constrained Versions of the Multiple Knapsack Problem On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic

More information

CHAPTER 3 MATHEMATICAL AND SIMULATION TOOLS FOR MANET ANALYSIS

CHAPTER 3 MATHEMATICAL AND SIMULATION TOOLS FOR MANET ANALYSIS 44 CHAPTER 3 MATHEMATICAL AND SIMULATION TOOLS FOR MANET ANALYSIS 3.1 INTRODUCTION MANET analysis is a multidimensional affair. Many tools of mathematics are used in the analysis. Among them, the prime

More information

Lecture: Local Spectral Methods (1 of 4)

Lecture: Local Spectral Methods (1 of 4) Stat260/CS294: Spectral Graph Methods Lecture 18-03/31/2015 Lecture: Local Spectral Methods (1 of 4) Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough. They provide

More information

CHAPTER 6. 1, if n =1, 2p(1 p), if n =2, n (1 p) n 1 n p + p n 1 (1 p), if n =3, 4, 5,... var(d) = 4var(R) =4np(1 p).

CHAPTER 6. 1, if n =1, 2p(1 p), if n =2, n (1 p) n 1 n p + p n 1 (1 p), if n =3, 4, 5,... var(d) = 4var(R) =4np(1 p). CHAPTER 6 Solution to Problem 6 (a) The random variable R is binomial with parameters p and n Hence, ( ) n p R(r) = ( p) n r p r, for r =0,,,,n, r E[R] = np, and var(r) = np( p) (b) Let A be the event

More information

Lecture 4: Bernoulli Process

Lecture 4: Bernoulli Process Lecture 4: Bernoulli Process Hyang-Won Lee Dept. of Internet & Multimedia Eng. Konkuk University Lecture 4 Hyang-Won Lee 1 / 12 Stochastic Process A stochastic process is a mathematical model of a probabilistic

More information