Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

Similar documents
Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

Probabilistic Circuits: A New Synthesis of Logic and Machine Learning

PSDDs for Tractable Learning in Structured and Unstructured Spaces

Tractable Learning in Structured Probability Spaces

Tractable Learning in Structured Probability Spaces

Knowledge Compilation and Machine Learning A New Frontier Adnan Darwiche Computer Science Department UCLA with Arthur Choi and Guy Van den Broeck

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Learning Logistic Circuits

Probabilistic Sentential Decision Diagrams

StarAI Full, 6+1 pages Short, 2 page position paper or abstract

On First-Order Knowledge Compilation. Guy Van den Broeck

Sum-Product Networks: A New Deep Architecture

Intelligent Systems (AI-2)

First-Order Knowledge Compilation for Probabilistic Reasoning

Intelligent Systems (AI-2)

Discriminative Learning of Sum-Product Networks. Robert Gens Pedro Domingos

Basing Decisions on Sentences in Decision Diagrams

Knowledge Extraction from DBNs for Images

Model Counting for Probabilistic Reasoning

Probabilistic Graphical Models for Image Analysis - Lecture 1

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

On the Role of Canonicity in Bottom-up Knowledge Compilation

Support Vector Machines

Solving PP PP -Complete Problems Using Knowledge Compilation

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries

Online Algorithms for Sum-Product

Price: $25 (incl. T-Shirt, morning tea and lunch) Visit:

Logistic Regression and Boosting for Labeled Bags of Instances

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

STA 4273H: Statistical Machine Learning

A Semantic Loss Function for Deep Learning with Symbolic Knowledge

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Optimal Feature Selection for Decision Robustness in Bayesian Networks

CS 188: Artificial Intelligence Spring Announcements

Bayesian Networks BY: MOHAMAD ALSABBAGH

Qualifying Exam in Machine Learning

Bayesian Machine Learning

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Nonparametric Bayesian Methods (Gaussian Processes)

Brief Introduction of Machine Learning Techniques for Content Analysis

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

STA 414/2104: Lecture 8

Intelligent Systems (AI-2)

Latent Dirichlet Allocation Introduction/Overview

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

STA 4273H: Statistical Machine Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Compiling Knowledge into Decomposable Negation Normal Form

Supervised Learning. George Konidaris

Importance Reweighting Using Adversarial-Collaborative Training

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

CS 570: Machine Learning Seminar. Fall 2016

Introduction to Machine Learning Midterm, Tues April 8

Bayesian Networks Inference with Probabilistic Graphical Models

Final Exam, Spring 2006

Logic and machine learning review. CS 540 Yingyu Liang

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

Linear Models for Classification

Pei Wang( 王培 ) Temple University, Philadelphia, USA

Lecture 15. Probabilistic Models on Graph

The Origin of Deep Learning. Lili Mou Jan, 2015

CS 188: Artificial Intelligence Spring Announcements

Introduction to Machine Learning CMU-10701

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Learning Bayesian Network Parameters under Equivalence Constraints

Qualifier: CS 6375 Machine Learning Spring 2015

Machine Learning for Signal Processing Bayes Classification and Regression

Uncertainty and Bayesian Networks

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

CSCE 478/878 Lecture 6: Bayesian Learning

Generative Models for Sentences

Introduction. Chapter 1

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

CPSC 540: Machine Learning

CS540 ANSWER SHEET

Hashing-Based Approximate Probabilistic Inference in Hybrid Domains: An Abridged Report

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

ECE521 Lecture 7/8. Logistic Regression

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Probability and Information Theory. Sargur N. Srihari

Bayesian Learning (II)

Active and Semi-supervised Kernel Classification

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

CS 188: Artificial Intelligence. Our Status in CS188

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Introduction to Machine Learning Midterm Exam

Final Examination CS 540-2: Introduction to Artificial Intelligence

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Day 5: Generative models, structured classification

Computational Complexity of Bayesian Networks

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Midterm, Fall 2003

Transcription:

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning Guy Van den Broeck HRL/ACTIONS @ KR Oct 28, 2018

Foundation: Logical Circuit Languages

Negation Normal Form Circuits Δ = (sun rain rainbow) [Darwiche 2002]

Decomposable Circuits Decomposable [Darwiche 2002]

Tractable for Logical Inference Is there a solution? (SAT) SAT(α β) iff SAT(α) or SAT(β) (always) SAT(α β) iff SAT(α) and SAT(β) (decomposable) How many solutions are there? (#SAT) Complexity linear in circuit size

Deterministic Circuits Deterministic [Darwiche 2002]

How many solutions are there? (#SAT)

How many solutions are there? (#SAT) Arithmetic Circuit

Tractable for Logical Inference Is there a solution? (SAT) How many solutions are there? (#SAT) Stricter languages (e.g., BDD, SDD): Equivalence checking Conjoin/disjoint/negate circuits Complexity linear in circuit size Compilation into circuit language by either exhaustive SAT solver conjoin/disjoin/negate

Learning with Logical Constraints

Motivation: Video [Lu, W. L., Ting, J. A., Little, J. J., & Murphy, K. P. (2013). Learning to track and identify players from broadcast sports videos.]

Motivation: Robotics [Wong, L. L., Kaelbling, L. P., & Lozano-Perez, T., Collision-free state estimation. ICRA 2012]

Motivation: Language Non-local dependencies: At least one verb in each sentence Sentence compression If a modifier is kept, its subject is also kept Information extraction Semantic role labeling and many more! [Chang, M., Ratinov, L., & Roth, D. (2008). Constraints as prior knowledge],, [Chang, M. W., Ratinov, L., & Roth, D. (2012). Structured learning with constrained conditional models.], [https://en.wikipedia.org/wiki/constrained_conditional_model]

Motivation: Deep Learning [Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al.. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476.]

Running Example Courses: Logic (L) Knowledge Representation (K) Probability (P) Artificial Intelligence (A) Data Constraints Must take at least one of Probability or Logic. Probability is a prerequisite for AI. The prerequisites for KR is either AI or Logic.

Structured Space unstructured L K P A 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 Must take at least one of Probability (P) or Logic (L). Probability is a prerequisite for AI (A). The prerequisites for KR (K) is either AI or Logic. 7 out of 16 instantiations are impossible structured L K P A 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1

Boolean Constraints unstructured L K P A 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 7 out of 16 instantiations are impossible structured L K P A 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1

Learning in Structured Spaces Data + Constraints (Background Knowledge) (Physics) Learn ML Model Today s machine learning tools don t take knowledge as input!

Deep Learning with Logical Constraints

Deep Learning with Logical Knowledge Data + Constraints Learn Deep Neural Network Input Neural Network Output Logical Constraint Output is probability vector p, not Boolean logic!

Semantic Loss Q: How close is output p to satisfying constraint? Answer: Semantic loss function L(α,p) Axioms, for example: If p is Boolean then L(p,p) = 0 If α implies β then L(α,p) L(β,p) (α more strict) Properties: If α is equivalent to β then L(α,p) = L(β,p) If p is Boolean and satisfies α then L(α,p) = 0 SEMANTIC Loss!

Semantic Loss: Definition Theorem: Axioms imply unique semantic loss: Probability of getting x after flipping coins with prob. p Probability of satisfying α after flipping coins with prob. p

Example: Exactly-One Data must have some label We agree this must be one of the 10 digits: Exactly-one constraint Semantic loss: For 3 classes: x 1 x 2 x 3 x 1 x 2 x 2 x 3 x 1 x 3 Only x i = 1 after flipping coins Exactly one true x after flipping coins

Semi-Supervised Learning Intuition: Unlabeled data must have some label Minimize exactly-one semantic loss on unlabeled data Train with existing loss + w semantic loss

MNIST Experiment Competitive with state of the art in semi-supervised deep learning

FASHION Experiment Outperforms Ladder Nets! Same conclusion on CIFAR10

What about real constraints? Paths cf. Nature paper Good variable assignment (represents route) 184 Bad variable assignment (does not represent route) 16,777,032 Unstructured probability space: 184+16,777,032 = 2 24 Space easily encoded in logical constraints [Nishino et al.]

How to Compute Semantic Loss? In general: #P-hard With a logical circuit for α: Linear! Example: exactly-one constraint: L(α,p) = L(, p) = - log( ) Why? Decomposability and determinism!

Predict Shortest Paths Add semantic loss for path constraint Is prediction the shortest path? This is the real task! Are individual edge predictions correct? Is output a path? (same conclusion for predicting sushi preferences, see paper)

Probabilistic Circuits

Logical Circuits L K L P A P L L P A P L K L P P K K A A A A Can we represent a distribution over the solutions to the constraint?

Recall: Decomposability L K L P A P L L P A P L K L P P K K A A A A AND gates have disjoint input circuits

Recall: Determinism L K L P A P L L P A P L K L P P K K A A A A Input: L, K, P, A are true and L, K, P, A are false Property: OR gates have at most one true input wire

PSDD: Probabilistic SDD 0.1 0.6 0.3 1 0 1 0 1 0 0.6 0.4 1 0 1 0 L K L P A P L L P A P L K L P P 0.8 0.2 K K 0.25 0.75 A A 0.9 0.1 A A Syntax: assign a normalized probability to each OR gate input

PSDD: Probabilistic SDD 0.1 0.6 0.3 1 0 1 0 1 0 0.6 0.4 1 0 1 0 L K L P A P L L P A P L K L P P Input: L, K, P, A are true 0.8 0.2 0.25 0.75 0.9 0.1 K K A A A A Pr(L,K,P,A) = 0.3 x 1 x 0.8 x 0.4 x 0.25 = 0.024

Each node represents a normalized distribution! 0.1 0.6 0.3 1 0 1 0 1 0 0.6 0.4 1 0 1 0 L K L P A P L L P A P L K L P P 0.8 0.2 A A 0.25 0.75 A A 0.9 0.1 A A Can read probabilistic independences off the circuit structure

Tractable for Probabilistic Inference MAP inference: Find most-likely assignment to x given y (otherwise NP-hard) Computing conditional probabilities Pr(x y) (otherwise #P-hard) Sample from Pr(x y) Algorithms linear in circuit size (pass up, pass down, similar to backprop)

Parameters are Interpretable 0.1 0.6 0.3 Probability of course P given L 1 0 1 0 1 0 0.6 0.4 1 0 1 0 L K L P A P L L P A P L K L P P 0.8 0.2 K K 0.25 0.75 A A Student takes course P 0.9 0.1 A A Student takes course L Explainable AI DARPA Program

Learning Probabilistic Circuit Parameters

Learning Algorithms Closed form max likelihood from complete data One pass over data to estimate Pr(x y) Not a lot to say: very easy! Where does the structure come from? For now: simply compiled from constraint

Combinatorial Objects: Rankings rank sushi rank sushi 1 fatty tuna 2 sea urchin 3 salmon roe 4 shrimp 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll 1 shrimp 2 sea urchin 3 salmon roe 4 fatty tuna 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll 10 items: 3,628,800 rankings 20 items: 2,432,902,008,176,640,000 rankings

Combinatorial Objects: Rankings rank sushi 1 fatty tuna 2 sea urchin 3 salmon roe 4 shrimp 5 tuna 6 squid Predict Boolean Variables: A ij - item i at position j Constraints: each item i assigned to a unique position (n constraints) 7 tuna roll 8 see eel 9 egg 10 cucumber roll each position j assigned a unique item (n constraints)

Learning Preference Distributions Special-purpose distribution: Mixture-of-Mallows # of components from 1 to 20 EM with 10 random seeds Implementation of Lu & Boutilier PSDD Circuit structure does not even depend on data!

Learning Probabilistic Circuit Structure

Structure Learning Primitive

Structure Learning Primitive Primitives maintain PSDD properties and constraint of root!

LearnPSDD Algorithm 1 2 3 (Vtree learning)* Construct the most naïve PSDD LearnPSDD (search for better structure) Generate candidate operations Execute the operation Simulate operations Works with or without logical constraint.

PSDDs are Sum-Product Networks are Arithmetic Circuits + 1 2 n * * * * 1 * 2 * n p 1 s 1 p 2 s 2 PSDD p n s n p 1 s 1 p 2 s 2 p n s n AC

Experiments on 20 datasets Compare with O-SPN: smaller size in 14, better LL in 11, win on both in 6 Compare with L-SPN: smaller size in 14, better LL in 6, win on both in 2 Compared to SPN learners, LearnPSDD gives comparable performance yet smaller size

Learn Mixtures of PSDDs State of the art on 6 datasets! Q: Help! I need to learn a discrete probability distribution A: Learn mixture of PSDDs! Strongly outperforms Bayesian network learners Markov network learners Competitive with SPN learners Cutset network learners

Logistic Circuits

What if I only want to classify Y? Pr(Y, A, B, C, D)

Logistic Circuits Represents Pr Y A, B, C, D Take all hot wires Sum their weights Push through logistic function

Logistic vs. Probabilistic Circuits Probabilities become log-odds Pr Y A, B, C, D Pr(Y, A, B, C, D)

Parameter Learning Reduce to logistic regression: Features associated with each wire Global Circuit Flow features Learning parameters θ is convex optimization!

Logistic Circuit Structure Learning Similar to LearnPSDD structure learning Generate candidate operations Calculate Gradient Variance Execute the best operation

Comparable Accuracy with Neural Nets

Significantly Smaller in Size Better Data Efficiency

Reasoning with Probabilistic Circuits

Compilation target for probabilistic reasoning Bayesian networks Factor graphs Probabilistic programs Markov Logic Relational Bayesian networks Probabilistic databases Probabilistic Circuits

Compilation for Prob. Inference

Collapsed Compilation To sample a circuit: 1. Compile bottom up until you reach the size limit 2. Pick a variable you want to sample 3. Sample it according to its marginal distribution in the current circuit 4. Condition on the sampled value 5. (Repeat) Asymptotically unbiased importance sampler

Circuits + importance weights approximate any query

Experiments Competitive with state-of-the-art approximate inference in graphical models. Outperforms it on several benchmarks!

Reasoning About Classifiers

Classifier Trimming Threshold T Threshold T C C F1 F2 F3 F4 F2 F3 Classifier α Classifier β Trim features while maintaining classification behavior

How to measure Similarity? Expected Classification Agreement What is the expected probability that a classifier α will agree with its trimming β?

Solving PP PP problems with 1 constrained SDDs f 2 3 L K P A L K L P A P L L P A P L K L P P L K f (L K) K K A A A A

SDD method faster than traditional jointree inference

Classification agreement and accuracy Higher agreement tends to get higher accuracy Additional dimension for feature selection

Conclusions Statistical ML Probability Symbolic AI Logic Circuits Connectionism Deep

Questions? PSDD with 15,000 nodes

References Doga Kisa, Guy Van den Broeck, Arthur Choi and Adnan Darwiche. Probabilistic sentential decision diagrams, In Proceedings of the 14th International Conference on Principles of Knowledge Representation and Reasoning (KR), 2014. Arthur Choi, Guy Van den Broeck and Adnan Darwiche. Tractable Learning for Structured Probability Spaces: A Case Study in Learning Preference Distributions, In Proceedings of 24th International Joint Conference on Artificial Intelligence (IJCAI), 2015. Arthur Choi, Guy Van den Broeck and Adnan Darwiche. Probability Distributions over Structured Spaces, In Proceedings of the AAAI Spring Symposium on KRR, 2015. Jessa Bekker, Jesse Davis, Arthur Choi, Adnan Darwiche and Guy Van den Broeck. Tractable Learning for Complex Probability Queries, In Advances in Neural Information Processing Systems 28 (NIPS), 2015. YooJung Choi, Adnan Darwiche and Guy Van den Broeck. Optimal Feature Selection for Decision Robustness in Bayesian Networks, In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), 2017.

References Yitao Liang, Jessa Bekker and Guy Van den Broeck. Learning the Structure of Probabilistic Sentential Decision Diagrams, In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI), 2017. Yitao Liang and Guy Van den Broeck. Towards Compact Interpretable Models: Shrinking of Learned Probabilistic Sentential Decision Diagrams, In IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI), 2017. YooJung Choi and Guy Van den Broeck. On Robust Trimming of Bayesian Network Classifiers, In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), 2018. Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang and Guy Van den Broeck. A Semantic Loss Function for Deep Learning with Symbolic Knowledge, In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018. Yitao Liang and Guy Van den Broeck. Learning Logistic Circuits, In Proceedings of the UAI 2018 Workshop: Uncertainty in Deep Learning, 2018. Tal Friedman and Guy Van den Broeck. Approximate Knowledge Compilation by Online Collapsed Importance Sampling, In Advances in Neural Information Processing Systems 31 (NIPS), 2018.