Data Canopy. Accelerating Exploratory Statistical Analysis. Abdul Wasay Xinding Wei Niv Dayan Stratos Idreos
|
|
- Ashlee Harris
- 6 years ago
- Views:
Transcription
1 Accelerating Exploratory Statistical Analysis Abdul Wasay inding Wei Niv Dayan Stratos Idreos
2 Statistics are everywhere! Algorithms Systems Analytic Pipelines
3 80 Temperature May 2017
4 80 Temperature May 2017 Mean
5 80 Temperature May 2017 Variance
6 + Lemonade Sale Correlation - Temperature Hot Choc. Sale Correlation
7 Repetitive Statistics
8 Repetition takes multiple forms Query Range Q1 Time Q2 Q3 Column Sub-range
9 Repetition takes multiple forms Query Range Q1 Time Q2 Q3 Column Overlap
10 Repetition takes multiple forms Query Range Q1 S1 Time Q2 S2 Q3 S3 Column Different Statistics
11 Repetition takes multiple forms Query Range Q1 S1 Time Q2 S2 Q3 S3 Column Mixed
12 Exploratory Workloads Exhibit Repetition Column set repeats* Template repeats* Exactly repeats* Queries (%) SQLShare 4.00 SDSS * at least once SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment Shrainik Jain, Dominik Moritz, Bill Howe, Ed Lazowska. SIGMOD 2016
13 Exploratory Workloads Exhibit Repetition Column set repeats* Template repeats* Exactly repeats* Queries (%) Repetition 54.65is everywhere - between 50% to 99% SQLShare 4.00 SDSS * at least once SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment Shrainik Jain, Dominik Moritz, Bill Howe, Ed Lazowska. SIGMOD 2016
14 How Do Existing Tools Perform? NumPy (Python) ModelTools (R) MonetDB
15 How Do Existing Tools Perform? NumPy (Python) ModelTools (R) MonetDB Sequence of Queries Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.
16 How Do Existing Tools Perform? NumPy (Python) ModelTools (R) MonetDB Sequence of Queries Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.
17 How Do Existing Tools Perform? NumPy Modeltools MonetDB Normalized execution time Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.
18 How Do Existing Tools Perform? NumPy Modeltools MonetDB Normalized execution time Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.
19 Data
20 Existing systems always compute statistics from scratch Data
21
22 Statistical queries Library of building blocks Data
23 Statistical queries Library of building blocks Data
24 Statistical queries Library of building blocks Data
25 Statistical queries Avoid redundant data access to accelerate statistical analysis Library of building blocks Data
26 Statistic Basic Aggregates { { { { a chunk Data
27 Q: Monthly Variance Variance = 1 n 2 1 n { { { { t Chunk size: 7 (a week)
28 Q: Monthly Variance Chunk size: 7 (a week) { { { { t
29 1 n Q: Monthly Variance n Chunk size: 7 (a week) { { { { t
30 Var (first week) i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week)
31 Var (first week) i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week)
32 Var (first week) i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week)
33 Var (first week) i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week)
34 ? Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk? Updates Data? Build? Memory Pressure
35 ? Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk? Updates Data? Build? Memory Pressure
36 ? Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk Data
37 Statistic Basic Aggregates
38 A basic aggregate Data column
39 A basic aggregate τ Data column
40 A basic aggregate f τ Data column
41 A basic aggregate f τ Data column τ(x) = x τ(x) = x 2
42 A basic aggregate f τ Data column f( ) f( ) f( ) f( )
43 A basic aggregate f τ Data column f( ) f( ) f( ) f( ) Example: max, min, sum, product
44 τ f( ) = ( ) = 2
45 τ f( ) = 1 1 ( ) = 2 max τ f( ) =max{ } ( ) =
46 Statistics are decomposed into building blocks f F τ τ f
47 τ n ( ) = Arithmetic Mean n τ ( ) =1
48 1 1 Y * 6 τ ( Y ) 1 n Geometric Mean ( ) = n τ ( ) =1
49 f τ Data Column
50 f Chunk f f τ Data Column
51 Basic Aggregates Chunks Aggregate function f F Statistic function f f τ τ Data column Chunk Transformation f Basic aggregates
52 Basic Aggregates Chunks Aggregate function f Statistic type F Statistic function f f τ τ Data column Chunk Transformation f Basic aggregates
53 Basic Aggregates Chunks Aggregate function f Statistic type Overlapping ranges F Statistic function f f τ τ Data column Chunk Sub-ranges Transformation f Basic aggregates
54 Basic Aggregates Chunks Aggregate function f Statistic type Overlapping ranges Mixed F Statistic function f f τ τ Data column Chunk Sub-ranges Transformation f Basic aggregates
55 Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk Data
56 Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk Data
57 f f τ τ Data column Chunk f Transformation f Basic aggregates
58 f Segment trees f τ τ Data column Chunk f Transformation f Basic aggregates
59 Data Logical chunk
60 Basic aggregates Leaves Data Logical chunk
61 Leaves Data Logical chunk
62 Leaves Data Logical chunk
63 Leaves Data
64 Leaves Data
65
66 One segment tree per basic aggregate per column Column a Column b
67 One segment tree per basic aggregate per column 2 2 Column a Column b
68 One segment tree per basic aggregate per column 2 2 Column a Column b
69 Decompose Statistic Basic Aggregates? Synthesize Store? Chunk size { { { { a chunk Data
70 Decompose Statistic Basic Aggregates? Synthesize Store? Chunk size { { { { a chunk Data
71 Calculate variance in temperature between May 15 and Oct. 15
72 Calculate variance in temperature between May 15 and Oct. 15 variance temperature May 15 and Oct. 15
73 Calculate variance in temperature between May 15 and Oct. 15 variance 2 N Recipe temperature May 15 and Oct. 15
74 Calculate variance in temperature between May 15 and Oct. 15 variance 2 Recipe N May 15 and Oct. 15 May a chunk June temperature July August Sept. Oct base data segment trees base data Chunk size = 1 month
75 Calculate variance in temperature between May 15 and Oct. 15 Plan 2 N Recipe temperature May June July August Sept. Oct base data segment trees base data fractal size = 1 month
76 Decompose Statistic Basic Aggregates Synthesize Store? Chunk size { { { { a chunk Data
77 Decompose Statistic Basic Aggregates Synthesize Store? Chunk size { { { { a chunk Data
78 base data Segment trees base data Chunk size = 1 month
79 base data Segment trees base data Chunk size = 1 month
80 Segment tree traversal O(log(n/c)) base data Segment trees base data n column size; c chunk size
81 Segment tree traversal Residual range scan O(log(n/c)) O(c) base data Segment trees base data n column size; c chunk size
82 Total cost = O(log(n/c) + c) Segment tree traversal Residual range scan O(log(n/c)) O(c) base data Segment trees base data n column size; c chunk size
83 Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size
84 Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size
85 Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size
86 Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size
87 Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size
88 Total cost = O(log(n/c) + c) Query cost Cache size I/O Chunk size n column size; c chunk size
89 Experimental Analysis
90 Range Query range distribution Range-Uniform Range-Zoom-in 1 thread In-memory 40M rows, 100 columns, 5 statistic types
91 Online
92 Online Offline
93 Range Uniform Workload 1000 MonetDB Latency (ms) Modeltools (R) NumPy (Python) Online Query Sequence 1 thread In-memory 40M rows, 100 columns, 5 statistic types
94 Range Uniform Workload 1000 MonetDB Latency (ms) 100 Modeltools (R) NumPy (Python) DC is up to 2 orders of magnitude faster 10 Online Query Sequence 1 thread In-memory 40M rows, 100 columns, 5 statistic types
95 Total execution time (s) Range-Uniform* Range-Zoom-in* No DC Online Offline * 2000 Queries 1 thread In-memory 40M rows, 100 columns, 5 statistic types
96 Total execution time (s) Range-Uniform* Range-Zoom-in* DC benefits 10 regardless of the exploration scenario No DC Online Offline 1 thread In-memory 40M rows, 100 columns, 5 statistic types
97 12000 Total execution time (s) Collaborative Filtering Bayesian Classification Simple Linear Regression 0 Online No DC Individual Sequential 1 thread In-memory 40M rows, 100 columns, 5 statistic types
98 : Accelerating Exploratory Statistical Analysis Statistics are everywhere! Repetitive statistics and data access synthesizes statistics from basic ingredients
99 UERIOSITY Accelerate by synthesis Provide hints daslab.seas.harvard.edu/queriosity
100 daslab.seas.harvard.edu/ data-canopy queriosity Thank You!
101 Range-Zoom-in Response time (ms) MonetDB NumPy (Python) Modeltools(R) Online DC 0.01 Range-Zoom-in Query Sequence 1 thread In-memory 40M rows, 100 columns, 5 statistic types
102 Segment Tree in f({f( 1 ),f( 2 )}, {f( 3 ),f( 4 )}) f({f( 1 ),f( 2 )}) f({f( 3 ),f( 4 )}) i = { ( i )} f( 1 ) f( 2 ) f( 3 ) f( 4 ) Chunk i Leaves: basic aggregate on every chunk Internal node: Aggregate function applied to its two children *f() = f({ f(1), f(2), f(n) })
103 Operation Modes Offline Online Speculative Time Prep Now Future Prep Now Future Prep Now Future
104 sub-tree. needed by incoming queries. Column Existing rows New rows Query range Extend segment trees as needed Response time (s) Insert Reconstruct x2 rows x2 columns x2 both 0 1x10 3 2x10 3 3x10 3 4x10 3 5x10 3 6x10 3 7x10 3 8x10 3 Total execution time (s) Query Update Query sequence Point updates (%) Inserts Updates
105 Performance under Memory Pressure Average response time (ms) x x x x x10 5 Query sequence
106 Handling Memory Pressure Base data in memory (%) Total execution time (s) Phase DC StatSys Phase Number of rows (M)
107 Memory Feasibility 8GB 10 8 In-memory bivariate statistics M rows 10M rows 100M rows 1T rows Number of columns
108 Memory Footprint 100 s=s o Univariate DC Memory footprint (GB) Bivariate DC s=64kb Max U workload Max U workload
109 Scaling with Queries Average response time (ms) U Z 0 2x10 5 4x10 5 6x10 5 8x10 5 1x10 6 Query sequence
110 Selecting the Chunk Size 256 Total Execution time (s) M 10M 100M 1B Chunk size (bytes)
NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0
NEC PerforCache Influence on M-Series Disk Array Behavior and Performance. Version 1.0 Preface This document describes L2 (Level 2) Cache Technology which is a feature of NEC M-Series Disk Array implemented
More informationLarge-Scale Behavioral Targeting
Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting
More informationAn Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets
IEEE Big Data 2015 Big Data in Geosciences Workshop An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets Fatih Akdag and Christoph F. Eick Department of Computer
More informationWavelets for Efficient Querying of Large Multidimensional Datasets
Wavelets for Efficient Querying of Large Multidimensional Datasets Cyrus Shahabi University of Southern California Integrated Media Systems Center (IMSC) and Dept. of Computer Science Los Angeles, CA 90089-0781
More informationClaude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique
Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)
More informationFPGA Implementation of a Predictive Controller
FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
More informationIn-Database Factorised Learning fdbresearch.github.io
In-Database Factorised Learning fdbresearch.github.io Mahmoud Abo Khamis, Hung Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich December 2017 Logic for Data Science Seminar Alan Turing Institute
More informationRAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures
RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures Guangyan Zhang, Zican Huang, Xiaosong Ma SonglinYang, Zhufan Wang, Weimin Zheng Tsinghua University Qatar Computing Research
More informationWindow-aware Load Shedding for Aggregation Queries over Data Streams
Window-aware Load Shedding for Aggregation Queries over Data Streams Nesime Tatbul Stan Zdonik Talk Outline Background Load shedding in Aurora Windowed aggregation queries Window-aware load shedding Experimental
More information1 Approximate Quantiles and Summaries
CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity
More informationImpression Store: Compressive Sensing-based Storage for. Big Data Analytics
Impression Store: Compressive Sensing-based Storage for Big Data Analytics Jiaxing Zhang, Ying Yan, Liang Jeff Chen, Minjie Wang, Thomas Moscibroda & Zheng Zhang Microsoft Research The Curse of O(N) in
More information2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51
2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each
More information416 Distributed Systems
416 Distributed Systems RAID, Feb 26 2018 Thanks to Greg Ganger and Remzi Arapaci-Dusseau for slides Outline Using multiple disks Why have multiple disks? problem and approaches RAID levels and performance
More informationScalable Asynchronous Gradient Descent Optimization for Out-of-Core Models
Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qin 1, Martin Torres 2, and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced August 31, 2017 Machine
More informationdjango in the real world
django in the real world yes! it scales!... YAY! Israel Fermin Montilla Software Engineer @ dubizzle December 14, 2017 from iferminm import more data Software Engineer @ dubizzle Venezuelan living in Dubai,
More informationArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde
ArcGIS Enterprise: What s New Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde ArcGIS Enterprise is the new name for ArcGIS for Server ArcGIS Enterprise Software Components ArcGIS Server Portal
More informationQR Decomposition in a Multicore Environment
QR Decomposition in a Multicore Environment Omar Ahsan University of Maryland-College Park Advised by Professor Howard Elman College Park, MD oha@cs.umd.edu ABSTRACT In this study we examine performance
More informationNotation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing
Parallel Processing CS575 Parallel Processing Lecture five: Efficiency Wim Bohm, Colorado State University Some material from Speedup vs Efficiency in Parallel Systems - Eager, Zahorjan and Lazowska IEEE
More informationP Q1 Q2 Q3 Q4 Q5 Tot (60) (20) (20) (20) (60) (20) (200) You are allotted a maximum of 4 hours to complete this exam.
Exam INFO-H-417 Database System Architecture 13 January 2014 Name: ULB Student ID: P Q1 Q2 Q3 Q4 Q5 Tot (60 (20 (20 (20 (60 (20 (200 Exam modalities You are allotted a maximum of 4 hours to complete this
More informationProgressive & Algorithms & Systems
University of California Merced Lawrence Berkeley National Laboratory Progressive Computation for Data Exploration Progressive Computation Online Aggregation (OLA) in DB Query Result Estimate Result ε
More informationECEN 689 Special Topics in Data Science for Communications Networks
ECEN 689 Special Topics in Data Science for Communications Networks Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Lecture 5 Optimizing Fixed Size Samples Sampling as
More informationChart types and when to use them
APPENDIX A Chart types and when to use them Pie chart Figure illustration of pie chart 2.3 % 4.5 % Browser Usage for April 2012 18.3 % 38.3 % Internet Explorer Firefox Chrome Safari Opera 35.8 % Pie chart
More informationPerformance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu
Performance Metrics for Computer Systems CASS 2018 Lavanya Ramapantulu Eight Great Ideas in Computer Architecture Design for Moore s Law Use abstraction to simplify design Make the common case fast Performance
More informationCSC 1700 Analysis of Algorithms: Warshall s and Floyd s algorithms
CSC 1700 Analysis of Algorithms: Warshall s and Floyd s algorithms Professor Henry Carter Fall 2016 Recap Space-time tradeoffs allow for faster algorithms at the cost of space complexity overhead Dynamic
More informationWeather Prediction Using Historical Data
Weather Prediction Using Historical Data COMP 381 Project Report Michael Smith 1. Problem Statement Weather prediction is a useful tool for informing populations of expected weather conditions. Weather
More informationBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduce Guozhang Wang, Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, Walker White Cornell University 1 What are Behavioral Simulations?
More informationUsing Oracle Rdb Partitioned Lock Trees. Norman Lastovica Oracle Rdb Engineering November 13, 06
Using Oracle Rdb Partitioned Lock Trees Norman Lastovica Oracle Rdb Engineering November 13, 06 Agenda Locking Review Partitioned Lock Trees in OpenVMS Clusters Performance tests 2 Disclaimers Tests represented
More informationThe Design Procedure. Output Equation Determination - Derive output equations from the state table
The Design Procedure Specification Formulation - Obtain a state diagram or state table State Assignment - Assign binary codes to the states Flip-Flop Input Equation Determination - Select flipflop types
More informationStatistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 1 / 23 Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R Mirko
More informationData analysis of massive data sets a Planck example
Data analysis of massive data sets a Planck example Radek Stompor (APC) LOFAR workshop, Meudon, 29/03/06 Outline 1. Planck mission; 2. Planck data set; 3. Planck data analysis plan and challenges; 4. Planck
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationCarnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem
Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications Lecture #26: Spatial Databases (R&G ch. 28) SAMs - Detailed outline spatial access methods problem dfn R-trees Faloutsos 2
More informationWhere to Find My Next Passenger?
Where to Find My Next Passenger? Jing Yuan 1 Yu Zheng 2 Liuhang Zhang 1 Guangzhong Sun 1 1 University of Science and Technology of China 2 Microsoft Research Asia September 19, 2011 Jing Yuan et al. (USTC,MSRA)
More informationRESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE
RESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE Yuan-chun Zhao a, b, Cheng-ming Li b a. Shandong University of Science and Technology, Qingdao 266510 b. Chinese Academy of
More informationMulti-Approximate-Keyword Routing Query
Bin Yao 1, Mingwang Tang 2, Feifei Li 2 1 Department of Computer Science and Engineering Shanghai Jiao Tong University, P. R. China 2 School of Computing University of Utah, USA Outline 1 Introduction
More informationGPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign
More informationLine Of Balance. Dr. Ahmed Elyamany
Line Of Balance Dr Ahmed Elyamany Intended Learning Outcomes Define the principles of Line of Balance (LOB) Demonstrate the application of LOB Understand the importance of LOB Understand the process of
More informationSummarizing Measured Data
Summarizing Measured Data 12-1 Overview Basic Probability and Statistics Concepts: CDF, PDF, PMF, Mean, Variance, CoV, Normal Distribution Summarizing Data by a Single Number: Mean, Median, and Mode, Arithmetic,
More informationCorrelated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1)
Correlated subqueries Query Optimization CPS Advanced Database Systems SELECT CID FROM Course Executing correlated subquery is expensive The subquery is evaluated once for every CPS course Decorrelate!
More informationThe conceptual view. by Gerrit Muller University of Southeast Norway-NISE
by Gerrit Muller University of Southeast Norway-NISE e-mail: gaudisite@gmail.com www.gaudisite.nl Abstract The purpose of the conceptual view is described. A number of methods or models is given to use
More informationApplied Cartography and Introduction to GIS GEOG 2017 EL. Lecture-2 Chapters 3 and 4
Applied Cartography and Introduction to GIS GEOG 2017 EL Lecture-2 Chapters 3 and 4 Vector Data Modeling To prepare spatial data for computer processing: Use x,y coordinates to represent spatial features
More informationIntroduction to Column Stores with MonetDB and Benchmark
Introduction to Column Stores with MonetDB and Benchmark Seminar Database Systems Master of Science in Engineering Major Software and Systems HSR Hochschule für Technik Rapperswil www.hsr.ch/mse Supervisor:
More informationScikit-learn. scikit. Machine learning for the small and the many Gaël Varoquaux. machine learning in Python
Scikit-learn Machine learning for the small and the many Gaël Varoquaux scikit machine learning in Python In this meeting, I represent low performance computing Scikit-learn Machine learning for the small
More informationLECTURE 04: LINEAR REGRESSION PT. 2. September 20, 2017 SDS 293: Machine Learning
LECTURE 04: LINEAR REGRESSION PT. 2 September 20, 2017 SDS 293: Machine Learning Announcements Stats TA hours start Monday (sorry for the confusion) Looking for some refreshers on mathematical concepts?
More informationDirect and Incomplete Cholesky Factorizations with Static Supernodes
Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)
More informationLarge-scale Linear RankSVM
Large-scale Linear RankSVM Ching-Pei Lee Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin Ching-Pei Lee (National Taiwan Univ.) 1 / 41 Outline 1 Introduction 2 Our
More informationPysynphot: A Python Re Implementation of a Legacy App in Astronomy
Pysynphot: A Python Re Implementation of a Legacy App in Astronomy Vicki Laidler 1, Perry Greenfield, Ivo Busko, Robert Jedrzejewski Science Software Branch Space Telescope Science Institute Baltimore,
More informationGridless DSMC. Spencer Olson. University of Michigan Naval Research Laboratory Now at: Air Force Research Laboratory
Gridless DSMC Spencer Olson University of Michigan Naval Research Laboratory Now at: Air Force Research Laboratory Collaborator: Andrew Christlieb, Michigan State University 8 June 2007 30 June 2009 J
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationNon-Preemptive and Limited Preemptive Scheduling. LS 12, TU Dortmund
Non-Preemptive and Limited Preemptive Scheduling LS 12, TU Dortmund 09 May 2017 (LS 12, TU Dortmund) 1 / 31 Outline Non-Preemptive Scheduling A General View Exact Schedulability Test Pessimistic Schedulability
More informationIn-Database Learning with Sparse Tensors
In-Database Learning with Sparse Tensors Mahmoud Abo Khamis, Hung Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich Toronto, October 2017 RelationalAI Talk Outline Current Landscape for DB+ML
More informationDictionary: an abstract data type
2-3 Trees 1 Dictionary: an abstract data type A container that maps keys to values Dictionary operations Insert Search Delete Several possible implementations Balanced search trees Hash tables 2 2-3 trees
More informationA Blackbox Polynomial System Solver on Parallel Shared Memory Computers
A Blackbox Polynomial System Solver on Parallel Shared Memory Computers Jan Verschelde University of Illinois at Chicago Department of Mathematics, Statistics, and Computer Science The 20th Workshop on
More informationSPATIAL INDEXING. Vaibhav Bajpai
SPATIAL INDEXING Vaibhav Bajpai Contents Overview Problem with B+ Trees in Spatial Domain Requirements from a Spatial Indexing Structure Approaches SQL/MM Standard Current Issues Overview What is a Spatial
More informationArcGIS Deployment Pattern. Azlina Mahad
ArcGIS Deployment Pattern Azlina Mahad Agenda Deployment Options Cloud Portal ArcGIS Server Data Publication Mobile System Management Desktop Web Device ArcGIS An Integrated Web GIS Platform Portal Providing
More informationLecture 11 Linear programming : The Revised Simplex Method
Lecture 11 Linear programming : The Revised Simplex Method 11.1 The Revised Simplex Method While solving linear programming problem on a digital computer by regular simplex method, it requires storing
More informationWeather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012
Weather Research and Forecasting (WRF) Performance Benchmark and Profiling July 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell,
More informationAn Integrative Model for Parallelism
An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale
More informationHigh Performance Computing
Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),
More informationDependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.
MFM Practitioner Module: Risk & Asset Allocation September 11, 2013 Before we define dependence, it is useful to define Random variables X and Y are independent iff For all x, y. In particular, F (X,Y
More informationTowards Indexing Functions: Answering Scalar Product Queries Arijit Khan, Pouya Yanki, Bojana Dimcheva, Donald Kossmann
Towards Indexing Functions: Answering Scalar Product Queries Arijit Khan, Pouya anki, Bojana Dimcheva, Donald Kossmann Systems Group ETH Zurich Moving Objects Intersection Finding Position at a future
More informationINF2220: algorithms and data structures Series 1
Universitetet i Oslo Institutt for Informatikk I. Yu, D. Karabeg INF2220: algorithms and data structures Series 1 Topic Function growth & estimation of running time, trees (Exercises with hints for solution)
More informationCombinational Logic Design Combinational Functions and Circuits
Combinational Logic Design Combinational Functions and Circuits Overview Combinational Circuits Design Procedure Generic Example Example with don t cares: BCD-to-SevenSegment converter Binary Decoders
More informationStatistics I Exercises Lesson 3 Academic year 2015/16
Statistics I Exercises Lesson 3 Academic year 2015/16 1. The following table represents the joint (relative) frequency distribution of two variables: semester grade in Estadística I course and # of hours
More informationCS246 Final Exam, Winter 2011
CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including
More informationQuiz 2. Due November 26th, CS525 - Advanced Database Organization Solutions
Name CWID Quiz 2 Due November 26th, 2015 CS525 - Advanced Database Organization s Please leave this empty! 1 2 3 4 5 6 7 Sum Instructions Multiple choice questions are graded in the following way: You
More informationMANAGING STORAGE STRUCTURES FOR SPATIAL DATA IN DATABASES ABSTRACT
Towards an Extended SQL for Treating Spatial Objects in: Y.C. Lee (ed.), Trends and Concerns of Spatial Sciences, Proceedings of the Second International Seminar, Fredericton, New Brunswick, Canada, June
More informationBlock AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark
Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise
More informationHardware Design I Chap. 4 Representative combinational logic
Hardware Design I Chap. 4 Representative combinational logic E-mail: shimada@is.naist.jp Already optimized circuits There are many optimized circuits which are well used You can reduce your design workload
More informationCHAPTER 4: DATASETS AND CRITERIA FOR ALGORITHM EVALUATION
CHAPTER 4: DATASETS AND CRITERIA FOR ALGORITHM EVALUATION 4.1 Overview This chapter contains the description about the data that is used in this research. In this research time series data is used. A time
More informationSTAT 520: Forecasting and Time Series. David B. Hitchcock University of South Carolina Department of Statistics
David B. University of South Carolina Department of Statistics What are Time Series Data? Time series data are collected sequentially over time. Some common examples include: 1. Meteorological data (temperatures,
More informationAstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis
AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Joint work with: Ian Foster: Univ. of
More information6. Iterative Methods for Linear Systems. The stepwise approach to the solution...
6 Iterative Methods for Linear Systems The stepwise approach to the solution Miriam Mehl: 6 Iterative Methods for Linear Systems The stepwise approach to the solution, January 18, 2013 1 61 Large Sparse
More informationArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan
ArcGIS GeoAnalytics Server: An Introduction Sarah Ambrose and Ravi Narayanan Overview Introduction Demos Analysis Concepts using GeoAnalytics Server GeoAnalytics Data Sources GeoAnalytics Server Administration
More informationReliability and Risk Analysis. Time Series, Types of Trend Functions and Estimates of Trends
Reliability and Risk Analysis Stochastic process The sequence of random variables {Y t, t = 0, ±1, ±2 } is called the stochastic process The mean function of a stochastic process {Y t} is the function
More informationEnsemble Methods and Random Forests
Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationGlobal Optimization of Common Subexpressions for Multiplierless Synthesis of Multiple Constant Multiplications
Global Optimization of Common Subexpressions for Multiplierless Synthesis of Multiple Constant Multiplications Yuen-Hong Alvin Ho, Chi-Un Lei, Hing-Kit Kwan and Ngai Wong Department of Electrical and Electronic
More informationData Exploration and Unsupervised Learning with Clustering
Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a
More informationEnhancing Reuse of Constraint Solutions to Improve Symbolic Execution
Enhancing Reuse of Constraint Solutions to Improve Symbolic Execution Xiangyang Jia (Wuhan University) Carlo Ghezzi (Politecnico di Milano) Shi Ying (Wuhan University) Outline Motivation Logical Basis
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun yzsun@ccs.neu.edu November 16, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining Matrix Data Decision
More informationScalable 3D Spatial Queries for Analytical Pathology Imaging with MapReduce
Scalable 3D Spatial Queries for Analytical Pathology Imaging with MapReduce Yanhui Liang, Stony Brook University Hoang Vo, Stony Brook University Ablimit Aji, Hewlett Packard Labs Jun Kong, Emory University
More informationRandomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory
Randomized Selection on the GPU Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory High Performance Graphics 2011 August 6, 2011 Top k Selection on GPU Output the top k keys
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationDependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.
Practitioner Course: Portfolio Optimization September 10, 2008 Before we define dependence, it is useful to define Random variables X and Y are independent iff For all x, y. In particular, F (X,Y ) (x,
More informationPERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah
PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.
More informationDigital Logic: Boolean Algebra and Gates. Textbook Chapter 3
Digital Logic: Boolean Algebra and Gates Textbook Chapter 3 Basic Logic Gates XOR CMPE12 Summer 2009 02-2 Truth Table The most basic representation of a logic function Lists the output for all possible
More informationQuery Optimization: Exercise
Query Optimization: Exercise Session 6 Bernhard Radke November 27, 2017 Maximum Value Precedence (MVP) [1] Weighted Directed Join Graph (WDJG) Weighted Directed Join Graph (WDJG) 1000 0.05 R 1 0.005 R
More informationMETA: An Efficient Matching-Based Method for Error-Tolerant Autocompletion
: An Efficient Matching-Based Method for Error-Tolerant Autocompletion Dong Deng Guoliang Li He Wen H. V. Jagadish Jianhua Feng Department of Computer Science, Tsinghua National Laboratory for Information
More informationPredicting New Search-Query Cluster Volume
Predicting New Search-Query Cluster Volume Jacob Sisk, Cory Barr December 14, 2007 1 Problem Statement Search engines allow people to find information important to them, and search engine companies derive
More informationComplex Dynamics of Microprocessor Performances During Program Execution
Complex Dynamics of Microprocessor Performances During Program Execution Regularity, Chaos, and Others Hugues BERRY, Daniel GRACIA PÉREZ, Olivier TEMAM Alchemy, INRIA, Orsay, France www-rocq.inria.fr/
More informationECE521 W17 Tutorial 1. Renjie Liao & Min Bai
ECE521 W17 Tutorial 1 Renjie Liao & Min Bai Schedule Linear Algebra Review Matrices, vectors Basic operations Introduction to TensorFlow NumPy Computational Graphs Basic Examples Linear Algebra Review
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationPatent Searching using Bayesian Statistics
Patent Searching using Bayesian Statistics Willem van Hoorn, Exscientia Ltd Biovia European Forum, London, June 2017 Contents Who are we? Searching molecules in patents What can Pipeline Pilot do for you?
More informationSHIFT-SPLIT: I/O Efficient Maintenance of Wavelet-Transformed Multidimensional Data
SHIFT-SPLIT: I/O Efficient aintenance of Wavelet-Transformed ultidimensional Data ehrdad Jahangiri University of Southern California Los Angeles, CA 90089-0781 jahangir@usc.edu Dimitris Sacharidis ational
More informationBelief Update in CLG Bayesian Networks With Lazy Propagation
Belief Update in CLG Bayesian Networks With Lazy Propagation Anders L Madsen HUGIN Expert A/S Gasværksvej 5 9000 Aalborg, Denmark Anders.L.Madsen@hugin.com Abstract In recent years Bayesian networks (BNs)
More informationCSE 4201, Ch. 6. Storage Systems. Hennessy and Patterson
CSE 4201, Ch. 6 Storage Systems Hennessy and Patterson Challenge to the Disk The graveyard is full of suitors Ever heard of Bubble Memory? There are some technologies that refuse to die (silicon, copper...).
More informationConcurrent Divide-and-Conquer Library
with Petascale Electromagnetics Applications, Tech-X Corporation CScADS Workshop on Libraries and Algorithms for Petascale Applications, 07/30/2007, Snowbird, Utah Background Particle In Cell (PIC) in
More informationDatabase Design and Normalization
Database Design and Normalization Chapter 11 (Week 12) EE562 Slides and Modified Slides from Database Management Systems, R. Ramakrishnan 1 1NF FIRST S# Status City P# Qty S1 20 London P1 300 S1 20 London
More information4.8 Efficiency Experts A Solidify Understanding Task
4.8 Efficiency Experts A Solidify Understanding Task In our work so far, we have worked with linear and exponential equations in many forms. Some of the forms of equations and their names are: 2012 www.flickr.com/photos/cannongod
More information