Using R for Iterative and Incremental Processing

Similar documents
High Performance Parallel Tucker Decomposition of Sparse Tensors

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Large-Scale Behavioral Targeting

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

A New Space for Comparing Graphs

Community Detection. fundamental limits & efficient algorithms. Laurent Massoulié, Inria

Link Analysis Ranking

U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018

STA141C: Big Data & High Performance Statistical Computing

Using SVD to Recommend Movies

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Restricted Boltzmann Machines for Collaborative Filtering

Lecture 13: Spectral Graph Theory

Multi-Approximate-Keyword Routing Query

Behavioral Simulations in MapReduce

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Facebook Friends! and Matrix Functions

CS246 Final Exam, Winter 2011

Lecture: Local Spectral Methods (1 of 4)

Link Mining PageRank. From Stanford C246

Link Analysis. Leonid E. Zhukov

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Parallel Transposition of Sparse Data Structures

Complex Social System, Elections. Introduction to Network Analysis 1

Large-scale Collaborative Ranking in Near-Linear Time

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

Quilting Stochastic Kronecker Graphs to Generate Multiplicative Attribute Graphs

A Note on Google s PageRank

STA141C: Big Data & High Performance Statistical Computing

Knowledge Discovery and Data Mining 1 (VO) ( )

CS6220: DATA MINING TECHNIQUES

Lecture 13 Spectral Graph Algorithms

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

CS224W: Methods of Parallelized Kronecker Graph Generation

Spectral Graph Theory and its Applications. Daniel A. Spielman Dept. of Computer Science Program in Applied Mathematics Yale Unviersity

Laplacian Matrices of Graphs: Spectral and Electrical Theory

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Lecture 9: September 28

Parallel Matrix Factorization for Recommender Systems

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Slides based on those in:

Eigenvalue Problems Computation and Applications

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

1 Matrix notation and preliminaries from spectral graph theory

Spectral Clustering on Handwritten Digits Database

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

CS6220: DATA MINING TECHNIQUES

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Approximating a single component of the solution to a linear system

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Matrix Factorization and Factorization Machines for Recommender Systems

CS249: ADVANCED DATA MINING

Spectral Graph Theory

Google Page Rank Project Linear Algebra Summer 2012

Ef#icient Processing of Large Graphs via Input Reduction

CSC 1700 Analysis of Algorithms: Warshall s and Floyd s algorithms

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

STA141C: Big Data & High Performance Statistical Computing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Implementation of a preconditioned eigensolver using Hypre

Sparse solver 64 bit and out-of-core addition

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

Clustering based tensor decomposition

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 3 Centrality, Similarity, and Strength Ties

Conditioning of the Entries in the Stationary Vector of a Google-Type Matrix. Steve Kirkland University of Regina

Review: From problem to parallel algorithm

Graphs, Vectors, and Matrices Daniel A. Spielman Yale University. AMS Josiah Willard Gibbs Lecture January 6, 2016

Project 2: Hadoop PageRank Cloud Computing Spring 2017

An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors

Data Mining Recitation Notes Week 3

Efficient algorithms for symmetric tensor contractions

The treatment of uncertainty in uniform workload distribution problems

Markov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions.

QALGO workshop, Riga. 1 / 26. Quantum algorithms for linear algebra.

A physical model for efficient rankings in networks

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Data Structures. Outline. Introduction. Andres Mendez-Vazquez. December 3, Data Manipulation Examples

Graph Models The PageRank Algorithm

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Algorithmic Primitives for Network Analysis: Through the Lens of the Laplacian Paradigm

Machine Learning for Data Science (CS4786) Lecture 11

Diagonalization. MATH 322, Linear Algebra I. J. Robert Buchanan. Spring Department of Mathematics

CS 188: Artificial Intelligence Spring Announcements

ECEN 689 Special Topics in Data Science for Communications Networks

CS425: Algorithms for Web Scale Data

Stat 315c: Introduction

ORIE 4741: Learning with Big Messy Data. Spectral Graph Theory

Big Data Analytics. Lucas Rego Drumond

Math 304 Handout: Linear algebra, graphs, and networks.

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

THE STATE OF CONTEMPORARY COMPUTING SUBSTRATES FOR OPTIMIZATION METHODS. Benjamin Recht UC Berkeley

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

CA-SVM: Communication-Avoiding Support Vector Machines on Distributed System

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Transcription:

Using R for Iterative and Incremental Processing Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Robert Schreiber UC Berkeley and HP Labs UC BERKELEY

Big Data, Complex Algorithms PageRank (Dominant eigenvector) Recommendations (Matrix factorization) Anomaly detection (Top-K eigenvalues) User Importance (Vertex Centrality) 6/29/2012 2

Big Data, Complex Algorithms PageRank (Dominant eigenvector) Machine learning + Graph algorithms Recommendations (Matrix factorization) Iterative Linear Algebra Operations Anomaly detection (Top-K eigenvalues) User Importance (Vertex Centrality) 6/29/2012 3

PageRank Using Matrices N s P 1 P 1 s P 1 P 1 N p M p Z Dominant eigenvector M = modified web graph matrix p = PageRank vector Simplified algorithm: repeat { p = M*p + Z} 6/29/2012 4

Breadth-first Search Using Matrices A B C D E X 1 0 0 0 0 A B C D E A 1 1 1 0 0 B 0 1 0 1 0 C 0 1 1 0 0 D 0 0 0 1 1 G E 0 0 0 0 1 G = adjacency matrix X = BFS vector Simplified algorithm: repeat { X = G*X } 6/29/2012 5

Breadth-first Search Using Matrices A B D A B D C E C E X 1 0 0 0 0 A B C D E * * * 0 0 A B C D E A 1 1 1 0 0 B 0 1 0 1 0 C 0 1 1 0 0 D 0 0 0 1 1 G E 0 0 0 0 1 G = adjacency matrix X = BFS vector Simplified algorithm: repeat { X = G*X } 6/29/2012 6

Breadth-first Search Using Matrices A B D A B D A B D C E C E C E X 1 0 0 0 0 * * * 0 0 * * * * 0 A B C D E A B C D E A B C D E A 1 1 1 0 0 B 0 1 0 1 0 C 0 1 1 0 0 D 0 0 0 1 1 G E 0 0 0 0 1 G = adjacency matrix X = BFS vector Simplified algorithm: repeat { X = G*X } 6/29/2012 7

Breadth-first Search Using Matrices A B D A B D A B D A B D C E C E C E C E X 1 0 0 0 0 * * * 0 0 * * * * 0 * * * * * A B C D E A B C D E A B C D E A B C D E A 1 1 1 0 0 B 0 1 0 1 0 C 0 1 1 0 0 D 0 0 0 1 1 G E 0 0 0 0 1 G = adjacency matrix X = BFS vector Simplified algorithm: repeat { X = G*X } 6/29/2012 8

Breadth-first Search Using Matrices 1 0 0 0 0 * * * 0 0 A B C D E A B C D E Matrix operations A 1 1 1 0 0 B 0 1 0 1 0 C 0 1 1 0 0 D 0 0 0 1 1 * * * * 0 A B C D E * * * * * A B C D E Easy to express Efficient to implement E 0 0 0 0 1 G = adjacency matrix X = BFS vector Simplified algorithm: repeat { X = G*X } 6/29/2012 9

Linear Algebra on Existing Frameworks Matrix Operations: Structured, coarse grained Need global state 6/29/2012 10

Linear Algebra on Existing Frameworks Matrix Operations: Structured, coarse grained Need global state Data-parallel frameworks MapReduce/Dryad Process each record in parallel Use case: Computing sufficient statistics 6/29/2012 11

Linear Algebra on Existing Frameworks Matrix Operations: Structured, coarse grained Need global state Data-parallel frameworks MapReduce/Dryad Process each record in parallel Use case: Computing sufficient statistics Graph-centric frameworks Pregel/GraphLab Process each vertex in parallel Use case: Graph models 6/29/2012 12

Challenge 1 Sparse Matrices 6/29/2012 13

Challenge 1 Sparse Matrices 6/29/2012 14

Challenge 1 Sparse Matrices 6/29/2012 15

Block density (normalized ) Challenge 1 Sparse Matrices 10000 LiveJournal Netflix ClueWeb-1B 1000 100 10 1 1 11 21 31 41 51 61 71 81 91 Block ID 6/29/2012 16

Block density (normalized ) Challenge 1 Sparse Matrices 10000 LiveJournal Netflix ClueWeb-1B 1000 100 10 1 1 11 21 31 41 51 61 71 81 91 Block ID 6/29/2012 17

Block density (normalized ) Challenge 1 Sparse Matrices 10000 LiveJournal Netflix ClueWeb-1B 1000 1000x more elements Computation imbalance 100 10 1 1 11 21 31 41 51 61 71 81 91 Block ID 6/29/2012 18

Challenge 2 Incremental Updates New movie ratings Refine recommendations Better suggestions Incremental computation on consistent view of data 6/29/2012 19

Presto Framework for large-scale iterative linear algebra Extend R for scalability and incremental updates 6/29/2012 20

Outline Motivation Programming model Design Applications and Results 6/29/2012 21

Programming Model One data structure: Distributed Array A darray() 6/29/2012 22

Programming Model Iteration: foreach 6/29/2012 23

Programming Model Iteration: foreach Compute Compute Compute Compute 6/29/2012 24

Programming Model Incremental updates: onchange, update Compute Compute Compute Compute 6/29/2012 25

Programming Model Incremental updates: onchange, update Data Updated Compute Compute Compute Compute 6/29/2012 26

Programming Model Incremental updates: onchange, update Data Updated Compute Compute Compute Compute 6/29/2012 27

PageRank Using Presto N s P 1 P 1 s P 1 P 1 N P M P_old Z M darray(dim=c(n,n),blocks=(s,n)) P darray(dim=c(n,1),blocks=(s,1)) while(..){ foreach(i,1:len, calculate(p=splits(p,i),m=splits(m,i), x=splits(p_old),z=splits(z,i)) { p (m*x)+z } ) } P_old P 6/29/2012 28

PageRank Using Presto N s P 1 P 1 s P 1 P 1 N P M P_old Z M darray(dim=c(n,n),blocks=(s,n)) P darray(dim=c(n,1),blocks=(s,1)) while(..){ foreach(i,1:len, calculate(p=splits(p,i),m=splits(m,i), Create Distributed Array x=splits(p_old),z=splits(z,i)) { p (m*x)+z } ) } P_old P 6/29/2012 29

PageRank Using Presto N s P 1 P 1 s P 1 P 1 N P M P_old Z M darray(dim=c(n,n),blocks=(s,n)) P darray(dim=c(n,1),blocks=(s,1)) while(..){ foreach(i,1:len, calculate(p=splits(p,i), m=splits(m,i), x=splits(p_old), z=splits(z,i)) { p (m*x)+z } ) } P_old P 6/29/2012 30

PageRank Using Presto N s P 1 P 1 s P 1 P 1 N P M P_old Z M darray(dim=c(n,n),blocks=(s,n)) P darray(dim=c(n,1),blocks=(s,1)) while(..){ Execute function in a cluster foreach(i,1:len, calculate(p=splits(p,i), m=splits(m,i), x=splits(p_old), z=splits(z,i)) { p (m*x)+z } ) } P_old P 6/29/2012 31

PageRank Using Presto N s P 1 P 1 s P 1 P 1 N P M P_old Z M darray(dim=c(n,n),blocks=(s,n)) P darray(dim=c(n,1),blocks=(s,1)) while(..){ Execute function in a cluster foreach(i,1:len, calculate(p=splits(p,i), m=splits(m,i), x=splits(p_old), z=splits(z,i)) { p (m*x)+z } Pass array partitions ) } P_old P 6/29/2012 32

Incremental PageRank N s P 1 P 1 s P 1 P 1 N P M P_old Z M darray(dim=c(n,n),blocks=(s,n)) P darray(dim=c(n,1),blocks=(s1)) onchange(m) { while(..){ foreach(i,1:len, calculate(p=splits(p,i), m=splits(m,i), x=splits(p_old), z=splits(z,i)) { p (m*x)+z update(p) }) P_old P 6/29/2012 33 }}

Incremental PageRank N s P 1 P 1 s P 1 P 1 N P M P_old Z M darray(dim=c(n,n),blocks=(s,n)) P darray(dim=c(n,1),blocks=(s1)) onchange(m) { while(..){ foreach(i,1:len, calculate(p=splits(p,i), m=splits(m,i), x=splits(p_old), z=splits(z,i)) { p (m*x)+z update(p) }) P_old P }} Execute when data changes 6/29/2012 34

Incremental PageRank N s P 1 P 1 s P 1 P 1 N P M P_old Z M darray(dim=c(n,n),blocks=(s,n)) P darray(dim=c(n,1),blocks=(s1)) onchange(m) { while(..){ foreach(i,1:len, calculate(p=splits(p,i), m=splits(m,i), x=splits(p_old), z=splits(z,i)) { p (m*x)+z update(p) }) P_old P }} Execute when data changes Update page rank vector 6/29/2012 35

Outline Motivation Programming model Design Applications and Results 6/29/2012 36

Dynamic Partitioning of Matrices 6/29/2012 37

Dynamic Partitioning of Matrices Profile execution 6/29/2012 38

Dynamic Partitioning of Matrices Profile execution 6/29/2012 39

Dynamic Partitioning of Matrices Profile execution Partition 6/29/2012 40

Dynamic Partitioning of Matrices Profile execution Partition 6/29/2012 41

Dynamic Partitioning of Matrices Profile execution Partition 6/29/2012 42

Dynamic Partitioning of Matrices Profile execution Partition Programmers specify size invariants. 6/29/2012 43

Dynamic Partitioning of Matrices Up to 2x performance improvement 6/29/2012 44

Incremental Updates Using Consistent Snapshots Web Graph Adjacency Matrix R P Q S 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 6/29/2012 45

Incremental Updates Using Consistent Snapshots Web Graph Adjacency Matrix R P Q S 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 onchange(m 1 ) 6/29/2012 46

Incremental Updates Using Consistent Snapshots Web Graph Adjacency Matrix Page Rank R P Q S 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0.035 0.006 0.008 0.032 update P 1 6/29/2012 47

Incremental Updates Using Consistent Snapshots Web Graph Adjacency Matrix Page Rank R P Q S 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0.035 0.006 0.008 0.032 6/29/2012 48

Incremental Updates Using Consistent Snapshots Web Graph Adjacency Matrix Page Rank R P Q S 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0.035 0.006 0.008 0.032 onchange(m 2 ) 0 1 1 1 0 0 0 0 1 0 0 0 1 0 0 0 6/29/2012 49

Incremental Updates Using Consistent Snapshots Web Graph P R S Q Adjacency Matrix 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 update Page Rank 0.035 0.006 0.008 0.032 0 1 1 1 0.035 0 0 0 0 0.028 1 0 0 0 0.008 1 0 0 0 0.032 6/29/2012 50

Versioned Distributed Arrays Mechanics of versioning update: Increment version number onchange: Bind a version number for the array before executing the handler 6/29/2012 51

Outline Motivation Programming model Design Applications and Results 6/29/2012 52

Applications Implemented in Presto Application Algorithm Presto LOC PageRank Eigenvector calculation 41 Triangle counting Top-K eigenvalues 121 Netflix recommendation Matrix factorization 130 Centrality measure Graph algorithm 132 k-path connectivity Graph algorithm 30 k-means Clustering 71 Sequence alignment Smith-Waterman 64 6/29/2012 53

Applications Implemented in Presto Application Algorithm Presto LOC PageRank Eigenvector calculation 41 Triangle counting Top-K eigenvalues 121 Fewer than 140 lines of code Netflix recommendation Matrix factorization 130 Centrality measure Graph algorithm 132 k-path connectivity Graph algorithm 30 k-means Clustering 71 Sequence alignment Smith-Waterman 64 6/29/2012 54

Time (sec) Presto is Fast! PageRank per-iteration execution time 1000 Presto Hadoop-InMem 100 10 1 8 16 32 64 Number of workers Data: 100M nodes, 1.2B edges. Setup: 10G network. 12 cores, 96GB RAM. 6/29/2012 55

Time (sec) Presto is Fast! PageRank per-iteration execution time 1000 Presto Hadoop-InMem 100 More than 20x faster than Hadoop (w/ in-memory storage) 10 1 8 16 32 64 Number of workers Data: 100M nodes, 1.2B edges. Setup: 10G network. 12 cores, 96GB RAM. 6/29/2012 56

More in the Paper Memory management, caching of partitions Scheduling operations Storage driver interface to HBase Fault tolerance 6/29/2012 57

Conclusion Linear Algebra is a powerful abstraction Easily express machine learning, graph algorithms Challenges: Sparse matrices, Incremental data Presto prototype extends R Open source version soon! 6/29/2012 58