Modeling Randomized Data Streams in Caching, Data Processing, and Crawling Applications

Similar documents
Modeling Randomized Data Streams in Caching, Data Processing, and Crawling Applications

Probabilistic Near-Duplicate. Detection Using Simhash

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Large-Scale Behavioral Targeting

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Modeling Residual-Geometric Flow Sampling

Slides based on those in:

Robust Lifetime Measurement in Large- Scale P2P Systems with Non-Stationary Arrivals

Understanding Sharded Caching Systems

Notes on MapReduce Algorithms

Reading: Chapter 9.3. Carnegie Mellon

CS6220: DATA MINING TECHNIQUES

Estimation of DNS Source and Cache Dynamics under Interval-Censored Age Sampling

CS249: ADVANCED DATA MINING

Algorithms and Data Structures

Project 2: Hadoop PageRank Cloud Computing Spring 2017

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

CSE 4502/5717 Big Data Analytics Spring 2018; Homework 1 Solutions

Handling a Concept Hierarchy

Skip Lists. What is a Skip List. Skip Lists 3/19/14

Coding for loss tolerant systems

Math 304 Handout: Linear algebra, graphs, and networks.

CSE 150. Assignment 6 Summer Maximum likelihood estimation. Out: Thu Jul 14 Due: Tue Jul 19

Min Congestion Control for High- Speed Heterogeneous Networks. JetMax: Scalable Max-Min

Progressive & Algorithms & Systems

Scalable Distributed Subgraph Enumeration

CS425: Algorithms for Web Scale Data

Cache-Oblivious Algorithms

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

On Delay-Independent Diagonal Stability of Max-Min Min Congestion Control

Question Paper Code :

Cache-Oblivious Algorithms

CS6220: DATA MINING TECHNIQUES

ECEN 689 Special Topics in Data Science for Communications Networks

Data and Algorithms of the Web

Jeffrey D. Ullman Stanford University

AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis

Driving Cache Replacement with ML-based LeCaR

Recap & Interval Scheduling

Impact of traffic mix on caching performance

Asymptotically Exact TTL-Approximations of the Cache Replacement Algorithms LRU(m) and h-lru

RANDOMIZED ALGORITHMS

Introduction to Data Mining

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

CSE P 501 Compilers. Value Numbering & Op;miza;ons Hal Perkins Winter UW CSE P 501 Winter 2016 S-1

Enabling Web GIS. Dal Hunter Jeff Shaner

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Cache Oblivious Stencil Computations

CS246 Final Exam. March 16, :30AM - 11:30AM

Computing PageRank using Power Extrapolation

Discrete Wiskunde II. Lecture 5: Shortest Paths & Spanning Trees

ECE 571 Advanced Microprocessor-Based Design Lecture 9

CSE525: Randomized Algorithms and Probabilistic Analysis April 2, Lecture 1

MVE165/MMG630, Applied Optimization Lecture 6 Integer linear programming: models and applications; complexity. Ann-Brith Strömberg

TTL Approximations of the Cache Replacement Algorithms LRU(m) and h-lru

Redundant Array of Independent Disks

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

Is Information-Centric Multi-Tree Routing Feasible?

IR: Information Retrieval

Problem-Solving via Search Lecture 3

Chapter 4. Greedy Algorithms. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Data Mining Recitation Notes Week 3

A Tightly-coupled XML Encoderdecoder For 3D Data Transaction

ArcGIS Deployment Pattern. Azlina Mahad

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

Computational Frameworks. MapReduce

CSE 202 Homework 4 Matthias Springer, A

Fast Searching for Andrews Curtis Trivializations

CS 580: Algorithm Design and Analysis

Google Page Rank Project Linear Algebra Summer 2012

6.830 Lecture 11. Recap 10/15/2018

Complexity (Pre Lecture)

On Content Indexing for Off-Path Caching in Information-Centric Networks

SMALL-WORLD NAVIGABILITY. Alexandru Seminar in Distributed Computing

Prediction Experience and New Model

Randomized Algorithms III Min Cut

Lab Exploration #4: Solar Radiation & Temperature Part II: A More Complex Computer Model

DATA MINING - 1DL105, 1DL111

Prioritized Garbage Collection Using the Garbage Collector to Support Caching

Online Social Networks and Media. Link Analysis and Web Search

Using R for Iterative and Incremental Processing

Advanced Implementations of Tables: Balanced Search Trees and Hashing

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Class President: A Network Approach to Popularity. Due July 18, 2014

AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis

Summarizing Measured Data

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

Introduction to Data Mining

MIT Loop Optimizations. Martin Rinard

Random Sampling on Big Data: Techniques and Applications Ke Yi

WHAT IS A SKIP LIST? S 2 S 1. A skip list for a set S of distinct (key, element) items is a series of lists

Comparing linear modularization criteria using the relational notation

Summarizing Measured Data

THE idea of network coding over error-free networks,

MATH6142 Complex Networks Exam 2016

Section 1.2 More on finite limits

Undirected Graphs. V = { 1, 2, 3, 4, 5, 6, 7, 8 } E = { 1-2, 1-3, 2-3, 2-4, 2-5, 3-5, 3-7, 3-8, 4-5, 5-6 } n = 8 m = 11

Transcription:

Modeling Randomized Data Streams in Caching, Data Processing, and Crawling Applications Sarker Tanzir Ahmed and Dmitri Loguinov Internet Research Lab Department of Computer Science and Engineering Texas A&M University April 29, 2015 1

Agenda Introduction Analysis of 1D Streams LRU Performance MapReduce Disk I/O Analysis of 2D Streams Properties of the Seen Set, Discovered Nodes, and the Frontier Conclusion 2

Introduction Key-value input pairs are common to MapReduce and many other types of applications Input is typically a finite length stream where the keys come off a finite set Experience of the processing application (e.g., RAM/disk usage, processing speed) depends largely on the properties of the stream (i.e. key frequency) Example: Least Recently Used (LRU) cache s hit rate is governed by popularities of items 3

Introduction (2) MapReduce applications combined output (the result of merging duplicate keys in a window of pairs) Depends on the frequency properties of the keys Usually, the higher the frequencies of each item, the smaller the size of the combined output Existing literature is missing accurate model Common to assume linear ratio between input and output 4

Agenda Introduction Analysis of 1D Streams LRU Performance MapReduce Disk I/O Analysis of 2D Streams Properties of the Seen Set, Discovered Nodes, and the Frontier Conclusion 5

Analysis of 1D Streams Define one-dimensional (1D) streams as discrete-time processes fy t g t 1, where each item Y t is observed at t Y t is unique (i.e., previously unseen) with probability p(t) (also called uniqueness probability), and duplicate otherwise Input is a stream of length T Keys belong to a finite set V of size n Each key v is repeated I(v) times (random variable I also denotes frequency distribution of v s) The seen set at t is denoted by S t, and the unseen set by U t We also assume uniform shuffle of the items across the stream Independent Reference Model (IRM) 6

Analysis of 1D Streams (2) Theorem 1: The probability of seeing a unique (previously unseen) key at t (using ² t =t=t) is: Fig: Verification of p(t) under E[I]=10, n=10k. 7

Analysis of 1D Streams (3) Theorem 2: The size of the seen set at t after using jaj=á(a) is: Fig: Verification of the size of S t (E[I]=10, n=10k). 8

1D Streams - Applications We consider two applications: Miss rate of LRU cache Disk I/O of MapReduce For verification, we use the following two workloads in addition to simulated input: IRLbot host graph (640M nodes, 6.8B edges, 55 GB) WebBase web graph (635M nodes, 4.2B edges, 35 GB) 9

LRU Cache Miss Rate Theorem 3: The miss rate of a LRU cache of size C is: Here, the value is obtained asf -1 (C), where f(t)=e[á(s t )] Fig. Verification of LRU miss rate in real graphs. 10

1D Streams - MapReduce Disk I/O Input is a stream of length T Entries are key-value pairs, each K+D bytes At time step t, one pair is processed by MapReduce Disk I/O consists of: Input with T pairs (some duplicate) Output with n unique pairs Sorted runs of size L Total disk overhead is W = (K +D)(T + n + 2L) Our goal is to derive L RAM can hold m pairs in a merge-sort MapReduce Then, k = dt/me is the number of sorted runs, where each contains E[jS m j] pairs on average 11

MapReduce Disk I/O (2) Theorem 4: Disk spill L of a merge-sort MapReduce is: And, the total disk I/O is thus: Fig: Verification of Disk I/O of a merge-sort MapReduce. 12

Agenda Introduction Analysis of 1D Streams LRU Performance MapReduce Disk I/O Analysis of 2D Streams Properties of the Seen Set, Discovered Nodes, and the Frontier Conclusion 13

Analysis of 2D Streams Two-dimensional (2D) streams are mainly applicable in analyzing graph traversal algorithms Consider a simple directed random graph G(V,E) V and E are the set of nodes and edges, respectively. Let E = T and the in/out-deg sequences be fi(v)g vϵv and fo(v)g vϵv Define the stream of edges of this graph seen by a crawler as a 2D discrete-time process f(x t, Y t )g T t=1 Here X t is the crawled node and Y t the destination node Define the crawled set as, the seen set as, and the frontier as The goal is to analyze the stream of Y t s, and the sets S t and F t as they change over crawl time t 14

Analysis of 2D Streams (2) Fig. Verification of p(t) under BFS crawl on graph (E[I]=10, n=10k). Fig. Verification of the seen set size in BFS (E[I]=10, n=10k). 15

Seen Set Properties Theorem 5: The average in/out-degree of the nodes in the seen set: Fig: Verification of the average in/out-degree of the seen set 16

Destination Node Properties Theorem 6: The in-degree distribution of the Y t is : Helps obtain an unbiased estimator of P(I=k) after observing m edges: Theorem 7: The average in/out-degree of Y t is independent of time and equals: while that of Y t, conditioned on its being unseen is: 17

Destination Node Properties - Verification Fig: Verification of the average in-degree of all Y t s and the unseen Y t s, respectively 18

Properties of the Frontier A crawling method s efficiency is a function of the frontier size The more the size, the more the load on duplicate-elimination, prioritization algorithms Theorem 8: The following iterative relation computes the size of the frontier (let Á(A)= A ): We consider two crawling methods to examine their frontier sizes: Breads First Search (BFS) Frontier RaNdomization (FRN), where any node from the frontier is picked randomly for crawling 19

Properties of the Frontier (2) Theorem 9: For BFS, the out-degree of the crawled node is given by: while that for FRN is simply: E[Á(F t )]= E[O(F t )]/E[O]. Fig: Verification of the frontier size with E[I]=10 and n = 10K. 20

Agenda Introduction Analysis of 1D Streams LRU Performance MapReduce Disk I/O Analysis of 2D Streams Properties of the Seen Set, Discovered Nodes, and the Frontier Conclusion 21

Conclusion Presented accurate analytic models of performance based on workload characterization Proposed a common modeling framework for a number of apparently unrelated fields (i.e., caching, MapReduce, crawl modeling) 22

Thank you! Questions? 23