Knowledge Extraction from DBNs for Images

Similar documents
Knowledge Extraction from Deep Belief Networks for Images

Lecture 16 Deep Neural Generative Models

arxiv: v2 [cs.ai] 1 Jun 2017

Deep unsupervised learning

The Origin of Deep Learning. Lili Mou Jan, 2015

Deep Belief Networks are compact universal approximators

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Unsupervised Neural-Symbolic Integration

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

UNSUPERVISED LEARNING

arxiv: v2 [cs.ai] 22 Jun 2017

arxiv: v2 [cs.ne] 22 Feb 2013

How to do backpropagation in a brain

An efficient way to learn deep generative models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Learning to Disentangle Factors of Variation with Manifold Learning

Restricted Boltzmann Machines

Deep Generative Models. (Unsupervised Learning)

Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

Greedy Layer-Wise Training of Deep Networks

Deep Neural Networks

Learning Deep Architectures

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Deep Boltzmann Machines

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines

Deep Learning Architecture for Univariate Time Series Forecasting

Jakub Hajic Artificial Intelligence Seminar I

Gaussian Cardinality Restricted Boltzmann Machines

Denoising Autoencoders

Inductive Principles for Restricted Boltzmann Machine Learning

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

arxiv: v3 [cs.lg] 18 Mar 2013

Learning Tetris. 1 Tetris. February 3, 2009

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Deep Belief Networks are Compact Universal Approximators

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

arxiv: v1 [cs.ne] 6 May 2014

The Recurrent Temporal Restricted Boltzmann Machine

TUTORIAL PART 1 Unsupervised Learning

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

CONVOLUTIONAL DEEP BELIEF NETWORKS

Competitive Learning for Deep Temporal Networks

Introduction to Restricted Boltzmann Machines

A Hybrid Deep Learning Approach For Chaotic Time Series Prediction Based On Unsupervised Feature Learning

ARestricted Boltzmann machine (RBM) [1] is a probabilistic

Unsupervised Learning

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Deep Learning & Neural Networks Lecture 2

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

Chapter 16. Structured Probabilistic Models for Deep Learning

Learning Energy-Based Models of High-Dimensional Data

An Efficient Learning Procedure for Deep Boltzmann Machines

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Restricted Boltzmann Machines

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton

Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Hertz, Krogh, Palmer: Introduction to the Theory of Neural Computation. Addison-Wesley Publishing Company (1991). (v ji (1 x i ) + (1 v ji )x i )

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Fundamentals of Computational Neuroscience 2e

CSC321 Lecture 20: Autoencoders

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

arxiv: v1 [stat.ml] 30 Mar 2016

arxiv: v1 [cs.cl] 23 Sep 2013

Deep Learning of Invariant Spatiotemporal Features from Video. Bo Chen, Jo-Anne Ting, Ben Marlin, Nando de Freitas University of British Columbia

Deep Learning for Natural Language Processing

Restricted Boltzmann Machines for Collaborative Filtering

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

arxiv: v1 [stat.ml] 30 Aug 2010

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

arxiv: v2 [cs.lg] 4 Sep 2016

Where Do Features Come From?

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Intelligent Systems (AI-2)

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Intelligent Systems (AI-2)

Deep Learning Autoencoder Models

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

The wake-sleep algorithm for unsupervised neural networks

Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines

arxiv: v4 [cs.lg] 16 Apr 2015

Fibring Neural Networks

Density functionals from deep learning

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Deep Belief Network Training Improvement Using Elite Samples Minimizing Free Energy

ABC-LogitBoost for Multi-Class Classification

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Learning Deep Architectures

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

An Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

Contrastive Divergence

Transcription:

Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London

Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental Results on Images 4 Conclusion and Future Work

Motivation Deep networks have shown good performance in image, audio, video and multimodal learning We would like to know why by studying the role of symbolic reasoning in DBNs. In particular, we would like to find out: How knowledge is represented in deep architectures Relations between Deep Networks and a hierarchy of rules How knowledge can be transferred to analogous domains

Restricted Boltzmann Machine Two-layer symmetric connectionist system [Smolensky, 1986] Represents a joint distribution P(V, H) Given training data, learning by Contrastive Divergence (CD) seeks to maximize P(V) = h P(V, H) It can be used to approximate the data distribution given new data (rather like an associative memory)

Restricted Boltzmann Machine (details) Generative model that can be trained to maximize log-likelihood L(θ D) = log( x D P(v = x)), where θ is set of parameters (weights and biases) and D is a training set of size n P(v = x) = 1 Z h exp( E(v, h)), where E is the energy of the network model This log-likelihood is intractable since it is not easy to compute partition function Z = v,h exp( E(v, h)) But it can be approximated efficiently using CD [Hinton, 2002]; w ij = 1 n n(v i h j ) step0 1 n n(v i h j ) step1

Deep Belief Networks Deep Belief Networks [Hinton et al., 2006] Stack of RBMs Greedily learns each pair of layers bottom-up with CD Fine tuning option 1: Split weight matrix into up and down weights (wake-sleep algorithm) Fine tuning option 2: Use as feedforward neural network and update weights using BP

Deep Belief Networks (example) The lower level layer is expected to capture low-level features Higher level layers combine features to learn progressively more abstract concepts Label can be attached at the top RBM for classification (class layer - 0 to 9) (second hidden layer - shapes) (first hidden layer - edges)

Rule Extraction from RBMs: related work [Pinkas, 1995]: rule extraction from symmetric networks using penalty logic; proved equivalence between conjunctive normal form and energy functions [Penning et al., 2011]: extraction of temporal logic rules from RTRBMs using sampling; extracts rules of the form hypothesis t belief 1,..., belief n hypothesis t 1 [Son Tran and Garcez, 2012]: rule extraction using confidence-value similar to penalty logic but maintaining implicational form; extraction without sampling

Rule Extraction from RBMs (cont.) Both penalty [Pinkas, 1995] and confidence-value [Penning et al., 2011, Son Tran and Garcez, 2012] represent the reliability of a rule Inference with penalty logic is to optimize a ranking function, thus similar to weighted-sat In [Penning et al., 2011], confidence-value is not used for inference, whilst confidence-values extracted by our method can be used for hierarchical inference

Our method: partial-model extraction Extracts rules c j : h j w pj >0 v p w nj <0 v n c j = wij >0 w ij wij <0 w ij (i.e. sum of absolute values of weights); also applies to visible units v i Example: 15 : h 0 v 1 v 2 v 3 7 : h 1 v 1 v 2 v 3 These rules are called partial-model because they capture partially the architecture and behavior of the network

Our method: complete-model extraction Confidence-vector: h j = [ w 1j, w 2j,...] h j Complete rules: c j : h j w ij >0 v i w ij<0 v i 15 : h 0 [5,3,7] v 1 v 2 v 3 7 : h 1 [2,4,1] v 1 v 2 v 3

Inference Inference c : h [w 1,w 2,...,w n ] b 1 b 2 b n α 1 : b 1, α 2 : b 2,..., α n : b n c h : h where c h = f (c (w 1 α 1 w 2 α 2 +... w n α n )) α i : b i means that b i is believed to hold with confidence α i f is a monotonically nondecreasing function. We use either sign-based (f (x) = 1 if x > 0 otherwise f (x) = 0) or logistic function; f normalizes the confidence value to [0,1]. c is the confidence of the rule; c h is the confidence of h In partial-models, w i = c n. The inference is deterministic (but stochastic inference is possible)

Partial-model vs. Complete-model Partial model: equalizes weights, can help generalization, good if weights are similar; information loss, otherwise Complete model: very much like the network, but difficult to visualize rules; baseline Example: 2 : h 0 v 1 v 2 2 : h 1 v 1 v 2 Both rules have the same confidence-value but the first is a better match to h 0 than the second is to h 1

XOR problem X Y Z 0 0 0 0 1 1 1 0 1 1 1 0 10.0600 3.9304 9.8485 W = 9.6408 9.5271 7.5398 5.0645 9.9315 9.8054 visb = [4.5196 4.3642 4.5371] 25 : h 0 x y z 23 : h 1 x y z 27 : h 2 x y z 13 : x y z If z is ground-truth then the combined, normalized rule is: 0.999 : z (x y) ( x y)

Logical inference vs. Stochastic inference DBN with 748-500-500-2000 nodes (+10 label nodes) was trained on MNIST handwritten digits dataset Figure shows the result of downward inference from the labels using the network (top) and using its complete model with a sigmoid function f for logical inference (bottom) To reconstruct the images from the labels using the network, we run up-down inference several times; to reconstruct the images from the rules, Gibbs sampling is not used, and we go downwards once through the rules

System pruning One can use rule extraction to prune the network by removing hidden units corresponding to rules with low confidence-value Reconstruction of images from pruned RBM (a) 500 units (b) 382 units (c) 212 units (d) 145 units Classification by SVM using features from pruned RBMs

Transfer Learning Problems in Machine Learning: Data in problem domain is limited Data in problem domain is difficult to label Prior knowledge in problem domain is hard to obtain Solution: Learn the knowledge from unlabelled data from related domains which are largely available and transfer the knowledge to the problem domain.

Transferring Knowledge to Learn Source domain: MNIST handwritten digits Target domains: ICDAR (digit recognition), TiCC (writer recognition) (a) MNIST dataset (b) ICDAR dataset (c) TiCC dataset

Experimental Results Source:Target SVM RBM PM Transfer CM Transfer MNIST : ICDAR 68.50 65.50 66.50 66.50 38.14 50.00 50.51 51.55 MNIST : TiCC 72.94 78.82 79.41 81.18 73.44 80.23 83.05 80.79 Figure : TiCC average accuracy vs. size of transferred knowledge

Conclusion and Future Work New knowledge extraction method for Deep Networks Initial results on image datasets and transfer learning Future work: More results and analysis of rules applicability to transfer learning (domain dependent?) Extraction of partial-models that approximate the network well (midway between complete and current partial model) Best way of generalizing and revising rules after transferring them (knowledge insertion to close the learning cycle)

References I Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Comput., 14(8):1771 1800. Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527 1554. Penning, L. d., Garcez, A. S. d., Lamb, L. C., and Meyer, J.-J. C. (2011). A neural-symbolic cognitive agent for online learning and reasoning. In IJCAI, pages 1653 1658.

References II Pinkas, G. (1995). Reasoning, nonmonotonicity and learning in connectionist networks that capture propositional knowledge. Artificial Intelligence, 77(2):203 247. Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In Rumelhart, D. E. and McClelland, J. L., editors, Parallel Distributed Processing: Volume 1: Foundations, pages 194 281. MIT Press, Cambridge. Son Tran and Garcez, A. (2012). ICML logic extraction from deep belief networks. In ICML 2012 Representation Learning Workshop, Edinburgh.