Interac(ve Data Analysis. University at Buffalo, SUNY Ying Yang

Interac(ve Data Analysis University at Buffalo, SUNY Ying Yang

Outline Convergent Inference Algorithm: Enjoy the benefits of both exact and approximate inference algorithms Future Work 2

Outline Convergent Inference Algorithm: Enjoy the benefits of both exact and approximate inference algorithms Future Work 3

PGM inference applica(on Model 4

Probabilis(c graphical model (PGM) Inference for missing values HappyBuy: Product id name brand category ROWID P123 Apple 6s, White NULL phone R1 P124 Apple 5s, Black NULL phone R2 P125 Samsung Note2 Samsung phone R3 P2345 Sony to inches NULL NULL R4 P34234 Dell, Intel 4 core Dell laptop R5 P34235 HP, AMD 2 core HP laptop R6 Raw input?! HappyBuy: SaneProduct id name brand category ROWID P123 Apple 6s, White Apple phone R1 P124 Apple 5s, Black Apple phone R2 P125 Samsung Note2 Samsung phone R3 P2345 Sony to inches Sony TV R4 P34234 Dell, Intel 4 core Dell laptop R5 P34235 HP, AMD 2 core HP laptop R6 PGM Output 5

Probabilis(c graphical model (PGM) Inference for missing values Model for missing values and schema matching Model for Missing values D I A B C E F H L D I G S G S J J Domain Size for each variable can be large With more model composed together, the graph become larger and larger 6

Exact or approximate inference? However, the conversion of a large graphical model into a JuncUon tree is difficult and Ume- consuming. Researchers therefore oyen resort to approximate inference methods to deal with complex graphical models. - - - - Handbook of Pa[ern RecogniUon and Computer Vision 7

Convergent Inference Algorithms (CIA) CIAs is a set of algorithms that enjoy the benefits of both exact and approximate inference algorithms by providing approximate results over the course of inference, and eventually converging to an exact inference result. 8

Convergent Inference Algorithms Goal: InteracUve analysis, choose algorithm in a principled way. Enjoy the benefit the advantage of both exact and approximate Progress loading bar 9

Convergent Inference Algorithms Goal: Feasibility 1. AYer a fixed period of Ume t, a CIA can provide approximate results with ε - δ error bounds, such that P( P t - P exat < ε) > 1 - δ 2. A CIA can obtain the exact result P exat in a bounded Ume t exact. Efficiency 1. The Ume complexity required by a good CIA to obtain an exact result should be compeuuve with variable eliminauon. 2. The quality of the approximauon produced by a good CIA should be compeuuve with the approximate inference algorithm given the same amount of Ume. 10

Cyclic Sampling D I G S J P(J)? 11

Gibbs Sampling D G I J S D I G S J P(D,I,G,S,J) 0 0 1 0 0 p1 0 0 2 0 0 p2 0 0 3 0 0 P3 1 0 1 0 0 P4 1 0 2 0 0 P5 1 0 3 0 0 p6 0 1 1 0 0 p7 0 1 2 0 0 p8 0 1 3 0 0 p9 Random Sampling with replacement 12

Cyclic Sampling D G I J S D I G S J P(D,I,G,S,J) 0 0 1 0 0 p1 0 0 2 0 0 p2 0 0 3 0 0 P3 1 0 1 0 0 P4 1 0 2 0 0 P5 1 0 3 0 0 p6 0 1 1 0 0 p7 0 1 2 0 0 p8 0 1 3 0 0 p9 Pseudorandom Sampling without replacement 13

D I D P(D) 0 0.6 1 0.4 G I D P(G I,D) 1 0 0 0.3 2 0 0 0.4 3 0 0 0.3 1 0 1 0.05 2 0 1 0.25 3 0 1 0.7 1 1 0 0.9 2 1 0 0.08 3 1 0 0.02 1 1 1 0.5 2 1 1 0.3 3 1 1 0.2 Cyclic Sampling I P(I) 0 0.7 1 0.3 S I P(S I) 0 0 0.95 1 0 0.05 0 1 0.2 1 J G S P(J G,S) 1 0.8 1 1 0 0.9 2 1 0 0.1 1 2 0 0.8 2 2 0 0.2 1 3 0 0.6 2 3 0 0.4 1 1 1 0.95 2 1 1 0.05 1 2 1 0.86 2 2 1 0.14 1 3 1 0.62 2 3 1 0.38 14 G S J D I G S J P(D,I,G,S,J) 0 0 1 0 0 p1 0 0 2 0 1 p2 0 0 3 0 0 P3 1 0 1 0 0 P4 1 0 2 0 0 P5 1 0 3 0 1 p6 0 1 1 0 0 p7 0 1 2 0 0 p8 0 1 3 0 0 p9

Cyclic Sampling Linear Congruen(al Generators: Cyclic pseudorandom number generator 15

Cyclic Sampling Feasibility 1. AYer a fixed period of Ume t, a CIA can provide approximate results with ε - δ error bounds, such that P( P t - P exat < ε) > 1 - δ 2. A CIA can obtain the exact result P exat in a bounded Ume t exact. 1. Confidence bound: [R. J. Sering. Probability inequaliues for the sum in sampling without replacement. The Annals of StaUsUcs,1974] 2. Converge to Exact Result: Linear CongruenUal Generators 16

Cyclic Sampling Efficiency 1. The Ume complexity required by a good CIA to obtain an exact result should be compeuuve with variable eliminauon. 2. The quality of the approximauon produced by a good CIA should be compeuuve with the approximate inference algorithm given the same amount of Ume. ComputaUon cost: For a graph with N binary variables, the running Ume is O(2 N ) 17

Variable Elimina(on/Belief Propaga(on The running Ume of Variable EliminaUon is Ψ max. Belief PropagaUon 2* Ψ max D I G Ψ 1 (D,G,I) = Φ D [D] Φ G [D,G,I] τ 1 (I,G) = Σ D Ψ1(D,G,I) Ψ 2 (G,I) = τ 1 (I,G) Φ I [I] Τ 2 (G) = Σ I Ψ 2 (G,I) 18

Leaky Joins Each Ume generate a bunch of samples. 19

Leaky Joins Analog to belief propagauon, but each Ume send parual messages 20

Leaky Joins Feasibility 1. AYer a fixed period of Ume t, a CIA can provide approximate results with ε - δ error bounds, such that P( P t - P exat < ε) > 1 - δ 2. A CIA can obtain the exact result P exat in a bounded Ume t exact. Confidence bound: bounded by ε c um with probability: 21

Leaky Joins Efficiency 1. The Ume complexity required by a good Convergent Inference Algorithm (CIA) to obtain an exact result should be compeuuve with variable eliminauon. 2. The quality of the approximauon produced by a good CIA should be compeuuve with the approximate inference algorithm given the same amount of Ume. Exact Ume, same as Variable EliminaUon with constant factor 22

Experiment Data sets Coherence Difficulty Intelligence Grade SAT Letter Job Happy 23

Experiment Approximate Inference Accuracy Insurance 24

Experiment Approximate Inference Accuracy Barley 25

Experiment Convergence Ume Coherence Difficulty Intelligence Grade SAT Letter Job Happy Extended Student 26

Outline Convergent Inference Algorithm: Enjoy the benefits of both exact and approximate inference algorithms Future Work 27

Future Work Dynamic Bayesian Network MoUvaUons: Modeling sequenual data is important in many areas of science and engineer. NLP(handwriUng, speech, human acuon and video recogniuons), Ume series analysis, bio- sequence analysis Feasibility: Special graph characterisucs, but CIA can be applied. E.g. HMM s forward- backward inference algorithm is essenually belief propagauon In- Database/In- SparkSQL MoUvaUons: Prevent moving data outside of database. I/O and privacy reason. Challenge: (1) ensure pseudorandom sampling using LCG (2) how to generate graph clustering plan Possible soluuons: (1) meta data (2) UDF query processing plan Distributed CIA MoUvaUons: inference is CPU intensive computauon. Possible soluuons: techniques like general distributed query processing or parallel probabilisuc query processing 28

Dynamic Bayesian Network - Future Work MoUvaUons: Modeling sequenual data is important in many areas of science and engineer. NLP(handwriUng, speech, human acuon and video recogniuons), Ume series analysis, bio- sequence analysis Feasibility: Special graph characterisucs, but CIA can be applied. E.g. HMM s forward- backward inference algorithm is essenually belief propagauon 29

In- Database/In- SparkSQL - Future Work MoUvaUons: Prevent moving data outside of database. I/O and privacy reason Challenge: (1) ensure pseudorandom sampling using LCG (2) how to generate graph clustering plan Possible soluuons: (1) meta data (2) UDF query processing plan 30

Distributed CIA - Future Work MoUvaUons: inference is CPU intensive computauon. Possible soluuons: techniques like general distributed query processing or parallel probabilisuc query processing 31

Future Work New Pricing plan MoUvaUons: Data analyucs systems are available as cloud services. Amazon EMR, Windows Azure HDInsight 32

Future Work New Pricing plan MoUvaUons: Current Pricing plans. Pay an hourly rate for every instance hour you use. The rate depends on the instance type (e.g. standard, high cpu, high memory, high storage, etc.) Drawbacks: Difficult to choose for analyst for a specific analysis job. 33

Future Work New Pricing plan On- Demand interacuve Pricing plan Update data Or refer to data in Amazon S3 Update ABC dataset 34

Future Work New Pricing plan On- Demand interacuve Pricing plan D I P(J)? Run Time Price G S J 35

Future Work Related Work - - - OrUz, Jennifer, Brendan Lee, and Magdalena Balazinska. "Perfenforce demonstra(on: Data analy(cs with performance guarantees." Proceedings of the 2016 InternaUonal Conference on Management of Data. ACM, 2016. - - - Windows Azure - - - Amazon EMR While exisung work is focusing on general SQL- like data analysis, we are focusing on inference on PGM, Dynamic BN (HMM, CRF) 36

Future Work New Pricing plan On- Demand interacuve Pricing plan D I P(J)? 30% 100% G S J P(J=0)=0.4 error bound: [ε δ] P(J=1)=0.6 Progress loading bar 37

Conclusions 1. Perform pseudo- random sampling without replacement using Linear CongruenUal Generators. 2. Proposes a new online aggregauon algorithm called Leaky Joins that produces samples of a query s result in the course of normally evaluaung the query. 3. Provide analysis of Ume complexity and confidence bound for CIAs. 4. Generalized the algorithms for any aggregate queries over small but dense tables. 38