Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

Similar documents
Large-Scale Behavioral Targeting

AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis

CS425: Algorithms for Web Scale Data

RESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE

CS 347 Parallel and Distributed Data Processing

How to deal with uncertainties and dynamicity?

Bentley Map V8i (SELECTseries 3)

12 Review and Outlook

Composite Quantization for Approximate Nearest Neighbor Search

Detecting Sparse Structures in Data in Sub-Linear Time: A group testing approach

AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis

RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures

Estimation of DNS Source and Cache Dynamics under Interval-Censored Age Sampling

A Tale of Two Erasure Codes in HDFS

High Performance Computing

Progressive & Algorithms & Systems

Behavioral Simulations in MapReduce

Ad Placement Strategies

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

CMP 338: Third Class

Sparse analysis Lecture V: From Sparse Approximation to Sparse Signal Recovery

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Efficient implementation of the overlap operator on multi-gpus

Low-Complexity FPGA Implementation of Compressive Sensing Reconstruction

ECEN 689 Special Topics in Data Science for Communications Networks

Fast, Cheap and Deep Scaling machine learning

Free and Open Source Software for Cadastre and Land Registration : A Hidden Treasure? Gertrude Pieper Espada. Overview

Multimedia Databases - 68A6 Final Term - exercises

arxiv: v1 [cs.dc] 22 Oct 2018

I Can t Believe It s Not Causal! Scalable Causal Consistency with No Slowdown Cascades

Essentials of Large Volume Data Management - from Practical Experience. George Purvis MASS Data Manager Met Office

Compressed Sensing: Extending CLEAN and NNLS

The conceptual view. by Gerrit Muller University of Southeast Norway-NISE

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

Generalized Orthogonal Matching Pursuit- A Review and Some

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Introducing a Bioinformatics Similarity Search Solution

Chapter 7. Sequential Circuits Registers, Counters, RAM

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

Image Compression Using the Haar Wavelet Transform

2.5D algorithms for distributed-memory computing

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING

Artificial Intelligence Hopfield Networks

Processing Big Data Matrix Sketching

Lecture 1 September 3, 2013

Compressed Sensing and Linear Codes over Real Numbers

CPU SCHEDULING RONG ZHENG

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce

The Quantum Supremacy Experiment

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Source Coding and Function Computation: Optimal Rate in Zero-Error and Vanishing Zero-Error Regime

On Content Indexing for Off-Path Caching in Information-Centric Networks

One Optimized I/O Configuration per HPC Application

Q520: Answers to the Homework on Hopfield Networks. 1. For each of the following, answer true or false with an explanation:

Online Scheduling Switch for Maintaining Data Freshness in Flexible Real-Time Systems

Geodatabase Management Pathway

FUSION METHODS BASED ON COMMON ORDER INVARIABILITY FOR META SEARCH ENGINE SYSTEMS

Compiling Techniques

Geodatabase Best Practices. Dave Crawford Erik Hoel

CSCE 561 Information Retrieval System Models

Query Analyzer for Apache Pig

Randomness-in-Structured Ensembles for Compressed Sensing of Images

High-Dimensional Indexing by Distributed Aggregation

A Nonuniform Quantization Scheme for High Speed SAR ADC Architecture

Panorama des modèles et outils de programmation parallèle

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Using Oracle Rdb Partitioned Lock Trees. Norman Lastovica Oracle Rdb Engineering November 13, 06

Environmental Chemistry through Intelligent Atmospheric Data Analysis (EnChIlADA): A Platform for Mining ATOFMS and Other Atmospheric Data

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Reconstruction of Block-Sparse Signals by Using an l 2/p -Regularized Least-Squares Algorithm

CS 700: Quantitative Methods & Experimental Design in Computer Science

Revenue Maximization in a Cloud Federation

ArcGIS Deployment Pattern. Azlina Mahad

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images

Models of collective inference

Is Information-Centric Multi-Tree Routing Feasible?

Data Streams & Communication Complexity

Matrix Assembly in FEA

HW #4. (mostly by) Salim Sarımurat. 1) Insert 6 2) Insert 8 3) Insert 30. 4) Insert S a.

A Cotton Irrigator s Decision Support System Using National, Regional and Local Data

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde

Adjoint-Based Uncertainty Quantification and Sensitivity Analysis for Reactor Depletion Calculations

Probabilistic Near-Duplicate. Detection Using Simhash

The Pros and Cons of Compressive Sensing

Review: From problem to parallel algorithm

Data Canopy. Accelerating Exploratory Statistical Analysis. Abdul Wasay Xinding Wei Niv Dayan Stratos Idreos

Distributed Data Fusion with Kalman Filters. Simon Julier Computer Science Department University College London

The File Geodatabase API. Craig Gillgrass Lance Shipman

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Allocate-On-Use Space Complexity of Shared-Memory Algorithms

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

Enabling ENVI. ArcGIS for Server

4th year Project demo presentation

Amortized Complexity Main Idea

Transcription:

Impression Store: Compressive Sensing-based Storage for Big Data Analytics Jiaxing Zhang, Ying Yan, Liang Jeff Chen, Minjie Wang, Thomas Moscibroda & Zheng Zhang Microsoft Research

The Curse of O(N) in Big Data Era In the old days, an O(N) algorithm was efficient But what if N is increasing fast? Parallelism is only a partial solution O(N) O N k k: # of machines It s an illusion that we can compute against all the data What we collect is always a sample By the time we finish computing, new data has generated Approximate Results Suffice! N k

Impression Store Provides an abstraction of big data vectors Support retrieval of big data components Store impression information rather than raw data Improvements the performance Save storage capacity Save IO bandwidth High scalability

System Design Query Top-K/Outlier-K Δx 1 Δx 2 Δx L 1 Δx L Update Synchronization f(δx 1 ) f(δx 2 ) Eventually consistent f(δx L 1 ) f(δx L ) Key technique: Compressive Sensing Node 1 Node 2 Node L 1 Node L 1. High Scalability: Any node can process any update / query Synchronization 2. Storage, memory, IO and communication cost efficient 3. High throughput to Impression Query- Top, Outlier & mode Compression Incremental updates on compression domain Uncompressing big components only

Introduction of Compressive Sensing Data Vector (Sparse) compression Length N Length M Measurement decompression x 1 x 2 x N φ 1 φ M y = Φ x y 1 y 2 y M M M N Recovery Algorithms: Orthogonal Matching Pursuit (OMP) Random Projection, Φ is a random matrix = N Recovered Data Vector x 1 x 2 x N

Compressive Sensing vs. Compression Decomposable Compression: x = x 1 + x 2 y = Φx = Φ x 1 + x 2 = Φx 1 + Φx 2 = y 1 + y 2 Distributed Aggregation y = y 1 + y 2 Continuous Updated Data y = y 1 + y 2 y 1 = Φx 1 y 2 = Φx 2 y 1 = Φx 0 y 2 = ΦΔx x 1 x 2 Data: x 1 + x 2 x 0 Δx Data: x0 + Δx Node 1 Node 2 Recovery Algorithm (OMP)/Our BOMP Big components have more precision Base data Update

Architecture Update Client Impression Store API Query Issue to any node Accumulated Update Δx 1 Impression Store Query Top-K/Outlier-K Measurement: Δy 1 = ΦΔx 1 Δy 2 1 Recovery: from y L Update Synchronization y 1 Δy 1 Δy 1 +Δy 2 Σ L 2 i=1 Δy i Σ L 1 i=1 Δy i Σ i=2 Δy i y 2 y L 1 y y = Σ i=1 Δy i Σ i=3 Δy i 1 + L Oracle Measurement Node 1 Node 2 Node L 1 Node L

Client Update Client Impression Store API Query 1. Map a table into data vectors 2. Translation: SQL->operations on Vector Issue to any node Accumulated Update Δx 1 Impression Store Query Top-K/Outlier-K Measurement: Δy 1 = ΦΔx 1 Δy 2 1 Recovery: from y L Update Synchronization y 1 Δy 1 Δy 1 +Δy 2 Σ L 2 i=1 Δy i Σ L 1 i=1 Δy i Σ i=2 Δy i y 2 y L 1 y y = Σ i=1 Δy i Σ i=3 Δy i 1 + L Oracle Measurement Node 1 Node 2 Node L 1 Node L

Architecture Update (i, x) Client Impression Store API Top(K): returns top-k Outlier(K): returns outlier-k and mode Issue to any node Accumulated Update Δx 1 Impression Store Query Top-K/Outlier-K Measurement: Δy 1 = ΦΔx 1 Δy 2 1 Recovery: from y L Update Synchronization y 1 Δy 1 Δy 1 +Δy 2 Σ L 2 i=1 Δy i Σ L 1 i=1 Δy i Σ i=2 Δy i y 2 y L 1 y y = Σ i=1 Δy i Σ i=3 Δy i 1 + L Oracle Measurement Node 1 Node 2 Node L 1 Node L

Query Processing Update Issue to any node Client Impression Store API Query Each node continuously works on three tasks: 1. Aggregate and compress data updates 2. Update Synchronization O(M) 3. Top/outlier-K recovery Matrix computation->gpu Impression Store Accumulated Update Δx 1 Query Top-K/Outlier-K Top/outlier-K Recovery Top/outlier-K Recovery Measurement: Δy 1 2 L 1 Recovery: Recovery: from from y 1 = ΦΔx 1 Δy 2 1 y L L Update Synchronization y 1 Δy 1 Δy 1 +Δy 2 Σ L 2 i=1 Δy i Σ L 1 i=1 Δy i Σ i=2 Δy i y 2 y L 1 y y = Σ i=1 Δy i Σ i=3 Δy i 1 + L Oracle Measurement Node 1 Node 2 Node L 1 Node L

Update Synchronization Client Update Impression Store API Query Goal: y i converges to y quickly Randomly issue Issue to to any node Accumulated Update Δx 1 Impression Store Query Top-K/Outlier-K Measurement: Δy 1 = ΦΔx 1 Δy 2 1 L Recovery: from y L Update Synchronization y 1 Δy 1 Δy Δy 1 1 +Δy +Δy 2 2 Σ i=1 Σ L 2 i=1 Δy ii ΣΣ L 1 L 1 i=1 i=1 Δy ii Oracle Measurement y 2 y L 1 y y = Σ L Σ L i=1 Δy i Σ L Σ L i=2 Δy i i=3 Δy i 1 +Δy Σ L i=2 Δy i i=3 Δy i L 1 + L Node 1 Node 2 Node L 1 Node L

Update Synchronization Synchronization policy ψ p l Δy y ψ w l Loop-free topology Master-Slave tree structure Small latency Load is not balanced Node p Node l Topology in between - trade off y = Δy + ψ q l q N(l) Send to p: ψ l p = Δy + q N(l) ψ q l ψ p l = y ψ p l Line structure Long latency Load is balanced Each pair of Send-Receive copies my not be the same all the time! The policy is proved to achieve eventual consistency.

Optimizations Speed up the recovery O(M 2 N) GPU: 30~40X speed-up For continuous updates Optimizing the recover algorithm by keeping the positions of the last recovery Reduce the complexity to O(M 3 )

Experiment Setup Error metrics Example Ground truth Key Value 1 100 2 80 3 50 4 30 5 10 Approximation Key Value 1 100 2 80 3 48 4 25 6 11 E p = 1 4 5 = 20% E v = 3.88% Workload: Revenue on Ads entries in Bing Search engine Group by 6 attributes (Market, Vertical, QueryClass ) Totally N=12,891 user-interested entries in the vector

Preliminary Results (1) Effect of M and N on Recover Quality E p E v

Preliminary Results (2) Bigger value can be recovered much more accurate with smaller M

Preliminary Results (3) Compare with traditional Top-K only approach (K=M) Approximated x 1 + x 2 (top-k) Approximated x 1 + x 2 (top-k) Recovery Algorithm Merge y = y 1 + y 2 Top-K in x 1 Top-K in x 2 y 1 = Φx 1 y 2 = Φx 2 x 1 x 2 x 1 x 2 Data: x 1 + x 2 Traditional Top-K Approach Compressive Sensing Approach

Ongoing and Future work Support more sophisticated queries Exploring CS and other techniques Work together with sampling Multiple parallel queries to different nodes can improve confidence

Thanks!