Interac(ve Data Analysis. University at Buffalo, SUNY Ying Yang

Similar documents
Message Passing and Junction Tree Algorithms. Kayhan Batmanghelich

Lecture 8: PGM Inference

Bayesian networks Lecture 18. David Sontag New York University

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Lecture 9: PGM Learning

Intelligent Systems (AI-2)

Sampling Algorithms for Probabilistic Graphical models

9 Forward-backward algorithm, sum-product on factor graphs

From Bayesian Networks to Markov Networks. Sargur Srihari

Inference' Sampling'Methods' Probabilis5c' Graphical' Models' Inference'In' Template' Models' Daphne Koller

Intelligent Systems (AI-2)

Arup Nanda Starwood Hotels

Variable Elimination: Algorithm

Exact Inference: Clique Trees. Sargur Srihari

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Introduc)on to Ar)ficial Intelligence

COMP90051 Statistical Machine Learning

Sistemi Cognitivi per le

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Variable Elimination: Algorithm

RESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE

6.867 Machine learning, lecture 23 (Jaakkola)

Intelligent Systems (AI-2)

Inference in Bayesian Networks

Tutorial letter 201/2/2018

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

One Optimized I/O Configuration per HPC Application

Bayesian Network Representation

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Graphical Models and Kernel Methods

Bayesian Networks. instructor: Matteo Pozzi. x 1. x 2. x 3 x 4. x 5. x 6. x 7. x 8. x 9. Lec : Urban Systems Modeling

Bayesian Network Inference Using Marginal Trees

Bayesian Networks BY: MOHAMAD ALSABBAGH

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Probabilistic Graphical Models

Hidden Markov Models (recap BNs)

Structure Learning: the good, the bad, the ugly

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Exact Inference Algorithms Bucket-elimination

Part IV: Monte Carlo and nonparametric Bayes

Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks. Yang Cao Emory University

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Figure 1: Bayesian network for problem 1. P (A = t) = 0.3 (1) P (C = t) = 0.6 (2) Table 1: CPTs for problem 1. C P (D = t) f 0.9 t 0.

Inference as Optimization

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Exact Inference Algorithms Bucket-elimination

Belief Update in CLG Bayesian Networks With Lazy Propagation

Query Optimization: Exercise

Extensions of Bayesian Networks. Outline. Bayesian Network. Reasoning under Uncertainty. Features of Bayesian Networks.

CS 188: Artificial Intelligence Spring Announcements

Real-time Systems: Scheduling Periodic Tasks

4 : Exact Inference: Variable Elimination

STA 4273H: Statistical Machine Learning

Basic math for biology

Introduction to Artificial Intelligence Midterm 2. CS 188 Spring You have approximately 2 hours and 50 minutes.

10601 Machine Learning Assignment 7: Graphical Models

6.830 Lecture 11. Recap 10/15/2018

Bios 6649: Clinical Trials - Statistical Design and Monitoring

D2D SALES WITH SURVEY123, OP DASHBOARD, AND MICROSOFT SSAS

Midterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015

Geodatabase An Overview

Chapter 8 Cluster Graph & Belief Propagation. Probabilistic Graphical Models 2016 Fall

Bayesian Hidden Markov Models and Extensions

Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis

Logical Provenance in Data-Oriented Workflows (Long Version)

CSEP 573: Artificial Intelligence

Insightful Reporting with the Presented by Todd Brown, Chris Nelson. Copyri g h t 2012 OSIso f t, LLC.

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan

Machine Learning for Data Science (CS4786) Lecture 19

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

The File Geodatabase API. Craig Gillgrass Lance Shipman

1. (12 points) Give a table-driven parse table for the following grammar: 1) X ax 2) X P 3) P DT 4) T P 5) T ε 7) D 1

Multi-Approximate-Keyword Routing Query

CS6220: DATA MINING TECHNIQUES

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February ISSN

Conditional Random Field

ESRI Survey Summit August Clint Brown Director of ESRI Software Products

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

A Tutorial on Learning with Bayesian Networks

Progressive & Algorithms & Systems

Structured Variational Inference

STA 414/2104: Machine Learning

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Recent Developments in Statistical Dialogue Systems

Lecture 15. Probabilistic Models on Graph

Query Processing with Optimal Communication Cost

Frequency-hiding Dependency-preserving Encryption for Outsourced Databases

CS Lecture 4. Markov Random Fields

Variational Inference. Sargur Srihari

P Q1 Q2 Q3 Q4 Q5 Tot (60) (20) (20) (20) (60) (20) (200) You are allotted a maximum of 4 hours to complete this exam.

Bayesian Networks. Exact Inference by Variable Elimination. Emma Rollon and Javier Larrosa Q

Message Passing Algorithms and Junction Tree Algorithms

CS 188: Artificial Intelligence. Bayes Nets

Presented by Mark Coding & Rohan Richards National Spatial Data Management Division, Jamaica Esri User Conference 2014

STATISTICAL METHODS IN AI/ML Vibhav Gogate The University of Texas at Dallas. Bayesian networks: Representation

Personal Field Data Collection by UM-FieldGIS (Integration of Google Map API to Mobile GIS)

Introduction to ArcGIS Server Development

CORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES

CS388: Natural Language Processing Lecture 4: Sequence Models I

p L yi z n m x N n xi

Transcription:

Interac(ve Data Analysis University at Buffalo, SUNY Ying Yang

Outline Convergent Inference Algorithm: Enjoy the benefits of both exact and approximate inference algorithms Future Work 2

Outline Convergent Inference Algorithm: Enjoy the benefits of both exact and approximate inference algorithms Future Work 3

PGM inference applica(on Model 4

Probabilis(c graphical model (PGM) Inference for missing values HappyBuy: Product id name brand category ROWID P123 Apple 6s, White NULL phone R1 P124 Apple 5s, Black NULL phone R2 P125 Samsung Note2 Samsung phone R3 P2345 Sony to inches NULL NULL R4 P34234 Dell, Intel 4 core Dell laptop R5 P34235 HP, AMD 2 core HP laptop R6 Raw input?! HappyBuy: SaneProduct id name brand category ROWID P123 Apple 6s, White Apple phone R1 P124 Apple 5s, Black Apple phone R2 P125 Samsung Note2 Samsung phone R3 P2345 Sony to inches Sony TV R4 P34234 Dell, Intel 4 core Dell laptop R5 P34235 HP, AMD 2 core HP laptop R6 PGM Output 5

Probabilis(c graphical model (PGM) Inference for missing values Model for missing values and schema matching Model for Missing values D I A B C E F H L D I G S G S J J Domain Size for each variable can be large With more model composed together, the graph become larger and larger 6

Exact or approximate inference? However, the conversion of a large graphical model into a JuncUon tree is difficult and Ume- consuming. Researchers therefore oyen resort to approximate inference methods to deal with complex graphical models. - - - - Handbook of Pa[ern RecogniUon and Computer Vision 7

Convergent Inference Algorithms (CIA) CIAs is a set of algorithms that enjoy the benefits of both exact and approximate inference algorithms by providing approximate results over the course of inference, and eventually converging to an exact inference result. 8

Convergent Inference Algorithms Goal: InteracUve analysis, choose algorithm in a principled way. Enjoy the benefit the advantage of both exact and approximate Progress loading bar 9

Convergent Inference Algorithms Goal: Feasibility 1. AYer a fixed period of Ume t, a CIA can provide approximate results with ε - δ error bounds, such that P( P t - P exat < ε) > 1 - δ 2. A CIA can obtain the exact result P exat in a bounded Ume t exact. Efficiency 1. The Ume complexity required by a good CIA to obtain an exact result should be compeuuve with variable eliminauon. 2. The quality of the approximauon produced by a good CIA should be compeuuve with the approximate inference algorithm given the same amount of Ume. 10

Cyclic Sampling D I G S J P(J)? 11

Gibbs Sampling D G I J S D I G S J P(D,I,G,S,J) 0 0 1 0 0 p1 0 0 2 0 0 p2 0 0 3 0 0 P3 1 0 1 0 0 P4 1 0 2 0 0 P5 1 0 3 0 0 p6 0 1 1 0 0 p7 0 1 2 0 0 p8 0 1 3 0 0 p9 Random Sampling with replacement 12

Cyclic Sampling D G I J S D I G S J P(D,I,G,S,J) 0 0 1 0 0 p1 0 0 2 0 0 p2 0 0 3 0 0 P3 1 0 1 0 0 P4 1 0 2 0 0 P5 1 0 3 0 0 p6 0 1 1 0 0 p7 0 1 2 0 0 p8 0 1 3 0 0 p9 Pseudorandom Sampling without replacement 13

D I D P(D) 0 0.6 1 0.4 G I D P(G I,D) 1 0 0 0.3 2 0 0 0.4 3 0 0 0.3 1 0 1 0.05 2 0 1 0.25 3 0 1 0.7 1 1 0 0.9 2 1 0 0.08 3 1 0 0.02 1 1 1 0.5 2 1 1 0.3 3 1 1 0.2 Cyclic Sampling I P(I) 0 0.7 1 0.3 S I P(S I) 0 0 0.95 1 0 0.05 0 1 0.2 1 J G S P(J G,S) 1 0.8 1 1 0 0.9 2 1 0 0.1 1 2 0 0.8 2 2 0 0.2 1 3 0 0.6 2 3 0 0.4 1 1 1 0.95 2 1 1 0.05 1 2 1 0.86 2 2 1 0.14 1 3 1 0.62 2 3 1 0.38 14 G S J D I G S J P(D,I,G,S,J) 0 0 1 0 0 p1 0 0 2 0 1 p2 0 0 3 0 0 P3 1 0 1 0 0 P4 1 0 2 0 0 P5 1 0 3 0 1 p6 0 1 1 0 0 p7 0 1 2 0 0 p8 0 1 3 0 0 p9

Cyclic Sampling Linear Congruen(al Generators: Cyclic pseudorandom number generator 15

Cyclic Sampling Feasibility 1. AYer a fixed period of Ume t, a CIA can provide approximate results with ε - δ error bounds, such that P( P t - P exat < ε) > 1 - δ 2. A CIA can obtain the exact result P exat in a bounded Ume t exact. 1. Confidence bound: [R. J. Sering. Probability inequaliues for the sum in sampling without replacement. The Annals of StaUsUcs,1974] 2. Converge to Exact Result: Linear CongruenUal Generators 16

Cyclic Sampling Efficiency 1. The Ume complexity required by a good CIA to obtain an exact result should be compeuuve with variable eliminauon. 2. The quality of the approximauon produced by a good CIA should be compeuuve with the approximate inference algorithm given the same amount of Ume. ComputaUon cost: For a graph with N binary variables, the running Ume is O(2 N ) 17

Variable Elimina(on/Belief Propaga(on The running Ume of Variable EliminaUon is Ψ max. Belief PropagaUon 2* Ψ max D I G Ψ 1 (D,G,I) = Φ D [D] Φ G [D,G,I] τ 1 (I,G) = Σ D Ψ1(D,G,I) Ψ 2 (G,I) = τ 1 (I,G) Φ I [I] Τ 2 (G) = Σ I Ψ 2 (G,I) 18

Leaky Joins Each Ume generate a bunch of samples. 19

Leaky Joins Analog to belief propagauon, but each Ume send parual messages 20

Leaky Joins Feasibility 1. AYer a fixed period of Ume t, a CIA can provide approximate results with ε - δ error bounds, such that P( P t - P exat < ε) > 1 - δ 2. A CIA can obtain the exact result P exat in a bounded Ume t exact. Confidence bound: bounded by ε c um with probability: 21

Leaky Joins Efficiency 1. The Ume complexity required by a good Convergent Inference Algorithm (CIA) to obtain an exact result should be compeuuve with variable eliminauon. 2. The quality of the approximauon produced by a good CIA should be compeuuve with the approximate inference algorithm given the same amount of Ume. Exact Ume, same as Variable EliminaUon with constant factor 22

Experiment Data sets Coherence Difficulty Intelligence Grade SAT Letter Job Happy 23

Experiment Approximate Inference Accuracy Insurance 24

Experiment Approximate Inference Accuracy Barley 25

Experiment Convergence Ume Coherence Difficulty Intelligence Grade SAT Letter Job Happy Extended Student 26

Outline Convergent Inference Algorithm: Enjoy the benefits of both exact and approximate inference algorithms Future Work 27

Future Work Dynamic Bayesian Network MoUvaUons: Modeling sequenual data is important in many areas of science and engineer. NLP(handwriUng, speech, human acuon and video recogniuons), Ume series analysis, bio- sequence analysis Feasibility: Special graph characterisucs, but CIA can be applied. E.g. HMM s forward- backward inference algorithm is essenually belief propagauon In- Database/In- SparkSQL MoUvaUons: Prevent moving data outside of database. I/O and privacy reason. Challenge: (1) ensure pseudorandom sampling using LCG (2) how to generate graph clustering plan Possible soluuons: (1) meta data (2) UDF query processing plan Distributed CIA MoUvaUons: inference is CPU intensive computauon. Possible soluuons: techniques like general distributed query processing or parallel probabilisuc query processing 28

Dynamic Bayesian Network - Future Work MoUvaUons: Modeling sequenual data is important in many areas of science and engineer. NLP(handwriUng, speech, human acuon and video recogniuons), Ume series analysis, bio- sequence analysis Feasibility: Special graph characterisucs, but CIA can be applied. E.g. HMM s forward- backward inference algorithm is essenually belief propagauon 29

In- Database/In- SparkSQL - Future Work MoUvaUons: Prevent moving data outside of database. I/O and privacy reason Challenge: (1) ensure pseudorandom sampling using LCG (2) how to generate graph clustering plan Possible soluuons: (1) meta data (2) UDF query processing plan 30

Distributed CIA - Future Work MoUvaUons: inference is CPU intensive computauon. Possible soluuons: techniques like general distributed query processing or parallel probabilisuc query processing 31

Future Work New Pricing plan MoUvaUons: Data analyucs systems are available as cloud services. Amazon EMR, Windows Azure HDInsight 32

Future Work New Pricing plan MoUvaUons: Current Pricing plans. Pay an hourly rate for every instance hour you use. The rate depends on the instance type (e.g. standard, high cpu, high memory, high storage, etc.) Drawbacks: Difficult to choose for analyst for a specific analysis job. 33

Future Work New Pricing plan On- Demand interacuve Pricing plan Update data Or refer to data in Amazon S3 Update ABC dataset 34

Future Work New Pricing plan On- Demand interacuve Pricing plan D I P(J)? Run Time Price G S J 35

Future Work Related Work - - - OrUz, Jennifer, Brendan Lee, and Magdalena Balazinska. "Perfenforce demonstra(on: Data analy(cs with performance guarantees." Proceedings of the 2016 InternaUonal Conference on Management of Data. ACM, 2016. - - - Windows Azure - - - Amazon EMR While exisung work is focusing on general SQL- like data analysis, we are focusing on inference on PGM, Dynamic BN (HMM, CRF) 36

Future Work New Pricing plan On- Demand interacuve Pricing plan D I P(J)? 30% 100% G S J P(J=0)=0.4 error bound: [ε δ] P(J=1)=0.6 Progress loading bar 37

Conclusions 1. Perform pseudo- random sampling without replacement using Linear CongruenUal Generators. 2. Proposes a new online aggregauon algorithm called Leaky Joins that produces samples of a query s result in the course of normally evaluaung the query. 3. Provide analysis of Ume complexity and confidence bound for CIAs. 4. Generalized the algorithms for any aggregate queries over small but dense tables. 38