Vers un apprentissage subquadratique pour les mélanges d arbres

Similar documents
Structure Learning: the good, the bad, the ugly

Machine Learning for Data Science (CS4786) Lecture 24

Scalable robust hypothesis tests using graphical models

Recent Advances in Bayesian Inference Techniques

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Approximating the Covariance Matrix with Low-rank Perturbations

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Adaptive Crowdsourcing via EM with Prior

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Quantitative Biology II Lecture 4: Variational Methods

Learning' Probabilis2c' Graphical' Models' BN'Structure' Structure' Learning' Daphne Koller

FINAL: CS 6375 (Machine Learning) Fall 2014

Conditional Chow-Liu Tree Structures for Modeling Discrete-Valued Vector Time Series

Machine Learning. Probabilistic KNN.

Pattern Recognition and Machine Learning

Probabilistic Time Series Classification

Active learning in sequence labeling

Latent Variable Models

11. Learning graphical models

Machine Learning Techniques for Computer Vision

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Approximate Inference Part 1 of 2

CS534 Machine Learning - Spring Final Exam

Approximate Inference Part 1 of 2

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Two Issues in Using Mixtures of Polynomials for Inference in Hybrid Bayesian Networks

Online Forest Density Estimation

Module Master Recherche Apprentissage et Fouille

Bayesian Networks Structure Learning (cont.)

Online Learning of Probabilistic Graphical Models

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Advanced Machine Learning & Perception

Bayesian networks approximation

Learning MN Parameters with Approximation. Sargur Srihari

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

Bayesian Methods for Machine Learning

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

13: Variational inference II

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models

Kyle Reing University of Southern California April 18, 2018

Approximate inference in Energy-Based Models

STA 4273H: Statistical Machine Learning

Efficient Algorithms for Bayesian Network Parameter Learning from Incomplete Data

PATTERN RECOGNITION AND MACHINE LEARNING

Understanding Covariance Estimates in Expectation Propagation

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

12 : Variational Inference I

STA 4273H: Sta-s-cal Machine Learning

ABC random forest for parameter estimation. Jean-Michel Marin

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

STA 4273H: Statistical Machine Learning

p L yi z n m x N n xi

Online Estimation of Discrete Densities using Classifier Chains

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Statistical Machine Learning from Data

FA Homework 5 Recitation 1. Easwaran Ramamurthy Guoquan (GQ) Zhao Logan Brooks

Latent Tree Approximation in Linear Model

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Ways to make neural networks generalize better

Protein Complex Identification by Supervised Graph Clustering

Answers and expectations

Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection

Variational Methods in Bayesian Deconvolution

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Sampling Algorithms for Probabilistic Graphical models

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Probabilistic and Bayesian Machine Learning

Large-Deviations and Applications for Learning Tree-Structured Graphical Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning, Midterm Exam

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bayesian Networks BY: MOHAMAD ALSABBAGH

Variance Reduction and Ensemble Methods

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Inferning with High Girth Graphical Models

Variational algorithms for marginal MAP

Logistic Regression Trained with Different Loss Functions. Discussion

Regularization in Neural Networks

Elements of Graphical Models DRAFT.

Decision Trees: Overfitting

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

VBM683 Machine Learning

Machine Learning. Ensemble Methods. Manfred Huber

Learning discrete graphical models via generalized inverse covariance matrices

ECE 5424: Introduction to Machine Learning

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Machine Learning in Simple Networks. Lars Kai Hansen

Latent Dirichlet Alloca/on

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

The connection of dropout and Bayesian statistics

Undirected Graphical Models

CSE 546 Final Exam, Autumn 2013

Randomized Algorithms

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Transcription:

Vers un apprentissage subquadratique pour les mélanges d arbres F. Schnitzler 1 P. Leray 2 L. Wehenkel 1 fschnitzler@ulg.ac.be 1 Université deliège 2 Université de Nantes 10 mai 2010 F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 1 / 19

The goal of this research is to improve the learning of bayesian networks in high-dimensional problems. This has great potential in many applications : Bioinformatics Power networks F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 2 / 19

1 Motivation 2 Algorithms 3 Experiments 4 Conclusion F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 3 / 19

The choice of the structure search space is a compromise. Sets of all bayesian networks Ability to model any density Superexponential number of structures Structure learning is difficult Overfitting Inference is difficult Sets of simpler structures Reduced modeling power Learning and inference potentially easier A tree is a graph without cycle where each variable has at most one parent. F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 4 / 19

Mixtures of trees combine qualities of bayesian networks and trees. A forest is a tree missing edges : A mixture of trees is an ensemble method : m P MT (x) = w i P Ti (x) i=1 F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 5 / 19

Mixtures of trees combine qualities of bayesian networks and trees. Several models large modeling power Simple models low complexity : inference is linear, learning : most algorithms are quadratic. Quadratic complexity could be too high for very large problems. In this work, we try to decrease it. Learning with mixtures of Trees, M. Meila & M.I. Jordan, JMLR 2001. F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 5 / 19

Quadratic scaling is due to the Chow-Liu algorithm. Maximize data likelihood Composed of 2 steps : Construction of a complete graph whose edge-weight are empirical mutual informations (O(n 2 N)) Computation of the maximum width spanning tree (O(n 2 logn)) Approximating discrete probability distributions with dependence trees, C. Chow & C. Liu, IEEE Trans. Inf. Theory 1968. F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 6 / 19

We propose to consider a random fraction δ of the edges of the complete graph. No longer optimal Reduction in complexity (for each term) : Construction of an uncomplete graph : O(δn 2 N) Computation of the maximum width spanning tree (O(δn 2 logn)) F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 7 / 19

Intuitively, the structure of the problem can be exploited to improve random sampling. In an euclidian space, similar problems can be approximated by sub-quadratic algorithms. When 2 points B and C are close to A, they are likely to be close as well. d(b,c) d(a,b)+d(a,c) Mutual information is not an euclidian distance. However the same reasoning can be applied. If the pairs A;B and A;C have high mutual information, I(B;C) may be high as well. I(B;C) I(A;B)+I(A;C) H(A) F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 8 / 19

We want to obtain knowledge about the structure. The algorithm aims at building : a set of clusters on the variables, relationships between these clusters, and then exploit it to target interesting edges. F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 9 / 19

We build the clusters iteratively : A center (X 5 ) is randomly chosen and compared to the 12 other variables. F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 10 / 19

We build the clusters iteratively : First cluster is created : it is composed of 5 members and 1 neighbour. Variables are assigned to a cluster based on two thresholds and their empirical mutual information with the center of the cluster. F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 10 / 19

We build the clusters iteratively : The second cluster is built around X 13, the variable the furthest away from X 5. It is only compared to the 7 remaining variables. F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 10 / 19

We build the clusters iteratively : After 4 iterations, all variables belong to a cluster, the algorithm stops. F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 10 / 19

We build the clusters iteratively : Computation of mutual information among variables belonging to the same cluster. F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 10 / 19

We build the clusters iteratively : Computation of mutual information between variables belonging to neighboring clusters. F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 10 / 19

1 Motivation 2 Algorithms 3 Experiments 4 Conclusion F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 11 / 19

Our algorithms were compared against two similar methods. Complexity reduction : Random tree sampling (O(n)), no connection to the data set. Variance reduction : Bagging (O(n 2 logn)). Probability Density Estimation by Perturbing and Combining Tree Structured Markov Networks, S. Ammar and al. ECSQARU 2009. F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 12 / 19

Experimental settings Tests were conducted on synthetic binary problems : 1000 variables, Average on 10 target distributions 10 data sets, Targets were generated randomly. Accuracy evaluation : Kullback-Leibler divergence is too computationally expensive : D KL (P t P l ) = x P t (x)log P t(x) P l (x). Monte carlo estimation : ˆD KL (P t P l ) = x Pt log P t(x) P l (x). F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 13 / 19

The proposed algorithm succeeds in improving the random strategy. Edges similar to the MWST for single trees of 200 variables : F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 14 / 19

Variation of the proportion of edges selected Results for a mixture of size 100 : Random edge sampling is : better than the optimal tree for small data sets, worse for bigger sets, The more edges considered, the closer to the optimal tree. 60%, 35%, 20%, 5% (,,, ) F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 15 / 19

The more terms in the mixture, the better the performance 300 samples : More sophisticated methods tend to converge slower, Random trees are always worse than an optimal tree, Other mixtures outperform CL tree. F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 16 / 19

The fewer samples, the (relatively) better the randomized methods. For high-dimensional problems, data sets will be small. Results for a mixture of size 100 : Random trees ( ) are better when samples are few, Bagging (-) is better for N > 50, Clever edge targeting ( ) is always better than random edge sampling ( ). F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 17 / 19

Methods can also be mixed : A combination ( ) of bagging (-) and random edge sampling (, 35%) : Performance lies between base methods. Improve bagging complexity. The fewer the sample, the closer to bagging. F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 18 / 19

Conclusion Our results on randomized mixture of trees : Accuracy loss is in line with the gain in complexity. The interest of randomization increases when the sample size decreases. Clever strategies improve results without hurting complexity Worth developing. Future work : Experiment other strategies, Include and test those improvements in other algorithms for building MT. F. Schnitzler (ULG) Sub-quadratic Mixtures of Trees JFRB 2010 19 / 19

Significance of the curves

Computation time Rand. trees Rand. edge sampling Clever edge sampling Bagging 2,063 s 64,569 s 59,687 s 168,703 s Table: Training CPU times, cumulated on 100 data sets of 1000 samples (MacOS X; Intel dual 2 GHz; 4GB DDR3; GCC 4.0.1)

H(B,C,A) H(B,C) H(A)+H(B A)+H(C AB) H(B,C) H(A)+(B A)+H(C A) H(B,C) H(B)+H(C)+2H(A) H(B,C)+H(B) +H(B A)+H(C A) +H(C)+H(A) H(B)+H(C) H(B,C) H(B)+H(A) H(B,A) +H(C)+H(A) H(C,A) H(A) I(B;C) I(A;B)+I(A;C) H(A)