Online Forest Density Estimation

Similar documents
Online Learning of Probabilistic Graphical Models

Learning with Large Number of Experts: Component Hedge Algorithm

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Achievability of Asymptotic Minimax Regret in Online and Batch Prediction

Learning, Games, and Networks

280 Eiji Takimoto and Manfred K. Warmuth

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Scalable robust hypothesis tests using graphical models

Bayesian Network Structure Learning using Factorized NML Universal Models

Bayesian Methods for Machine Learning

Graphical Models for Collaborative Filtering

Worst-Case Bounds for Gaussian Process Models

Achievability of Asymptotic Minimax Regret by Horizon-Dependent and Horizon-Independent Strategies

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Stochastic Proximal Gradient Algorithm

Probabilistic and Bayesian Machine Learning

The No-Regret Framework for Online Learning

Advanced Machine Learning

Online Learning and Sequential Decision Making

STA 4273H: Statistical Machine Learning

Online Learning with Feedback Graphs

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

STA 4273H: Statistical Machine Learning

Move from Perturbed scheme to exponential weighting average

Online Bayesian Passive-Aggressive Learning

Lecture 6: Graphical Models: Learning

Bayesian Models in Machine Learning

Dynamic Matrix-Variate Graphical Models A Synopsis 1

Learning in Bayesian Networks

Structure Learning: the good, the bad, the ugly

Applications of on-line prediction. in telecommunication problems

The Free Matrix Lunch

Probabilistic Graphical Models

CPSC 540: Machine Learning

Overfitting, Bias / Variance Analysis

The Monte Carlo Method: Bayesian Networks

An Identity for Kernel Ridge Regression

Efficient Minimax Strategies for Square Loss Games

Introduction to Restricted Boltzmann Machines

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

CSCI-567: Machine Learning (Spring 2019)

CPSC 540: Machine Learning

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Probabilistic Graphical Models

Learning Bayesian network : Given structure and completely observed data

Machine Learning Summer School

Adaptive Sampling Under Low Noise Conditions 1

Machine Learning Lecture 5

From Bandits to Experts: A Tale of Domination and Independence

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

Advanced Machine Learning

Full-information Online Learning

Learning discrete graphical models via generalized inverse covariance matrices

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Bayesian Networks Inference with Probabilistic Graphical Models

Probabilistic Graphical Models (I)

Lecture 9: PGM Learning

Online Learning and Online Convex Optimization

Chapter 14 Combining Models

Patterns of Scalable Bayesian Inference Background (Session 1)

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Advanced Machine Learning & Perception

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

CPSC 540: Machine Learning

CS599: Convex and Combinatorial Optimization Fall 2013 Lecture 17: Combinatorial Problems as Linear Programs III. Instructor: Shaddin Dughmi

New Algorithms for Contextual Bandits

Advanced Machine Learning

The Online Approach to Machine Learning

Totally Corrective Boosting Algorithms that Maximize the Margin

Pattern Recognition and Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Near-Optimal Algorithms for Online Matrix Prediction

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

UC Santa Cruz UC Santa Cruz Electronic Theses and Dissertations

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

STA 4273H: Statistical Machine Learning

Efficient Information Planning in Graphical Models

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Sampling Algorithms for Probabilistic Graphical models

Online Kernel PCA with Entropic Matrix Updates

Exponential Weights on the Hypercube in Polynomial Time

Agnostic Online learnability

Online Algorithms for Sum-Product

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Intelligent Systems:

Bregman Divergences for Data Mining Meta-Algorithms

Preference Elicitation for Sequential Decision Problems

Probabilistic Graphical Models

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Online Learning of Non-stationary Sequences

Perceptron Mistake Bounds

PMR Learning as Inference

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

3 : Representation of Undirected GM

The canadian traveller problem and its competitive analysis

Transcription:

Online Forest Density Estimation Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr UAI 16 1

Outline 1 Probabilistic Graphical Models 2 Online Density Estimation 3 Online Forest Density Estimation 4 Conclusions 2

000000000000 000000000001 000000000010 000000000011 000000000100 000000000101 000000000110 000000000111... 111111111000 111111111001 111111111010 111111111011 111111111100 111111111101 111111111110 111111111111 0.01 0.49 0.48 0.02 X 8 X 9 X 3 X 12 X 11 X 7 X 2 X 10 X 6 X 1 X 4 X 5 0.75 0.25 Graphical models encode high-dimensional distributions in a compact and intuitive way: Qualitative uncertainty (interdependencies) is captured by the structure Quantitative uncertainty (probabilities) is captured by the parameters 3

Graphical Models Bayesian Networks Sparse Bayesian Networks Bayesian Polytrees Bayesian Forests Markov Networks Sparse Markov Networks Bounded Tree-width MNs Markov Forests For an outcome space X R n, a class of graphical models is a pair M = G Θ, where G is space of n-dimensional graphs, and Θ is a space of d-dimensional vectors. G captures structural constraints (directed vs. undirected, sparse vs. dense, etc.) Θ captures parametric constraints (binomial, multinomial, Gaussian, etc.) 4

Outline 1 Probabilistic Graphical Models 2 Online Density Estimation 3 Online Forest Density Estimation 4 Conclusions 5

Online Density Estimation l ( M t,x t ) Environment 100000000111 M t = ( G t,θ t ) Learner Repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses (the structure and the parameters of) a model M t M ; the environment responds by an outcome x t X, and the learner incurs the log-loss l(m t,x t ) = lnp M t (x t ) 6

Online density estimation is particularly suited to: * Adaptive environments, where the target distribution can change over time; * Streaming applications, where all the data is not available in advance; * Large-scale datasets, by processing only one outcome at a time. In the literature of online density estimation (universal coding): uni-dimensional models (binomial, multinomial, exponential families) have been extensively studied Xie and Barron (2000); Takimoto and Warmuth (2000); Kotłowski and Grünwald (2011),... much less is known, however, about multi-dimensional models, especially graphical models, where both the structure and the parameters are updated at each iteration! 7

The performance of an online learning algorithm A is measured according to two metrics: Minimax Regret Defined by the maximum, over every sequence of outcomes x 1:T = (x 1,,x T ), of the cumulative relative loss between A and the best model in M : [ ] T R(A,T ) = max l(m t,x t T ) min l(m,x t ) x 1:T X T t=1 M M t=1 Per-round complexity Given by the amount of computational resources spent by A at each trial t, for choosing a model M t in M, and evaluating its log-loss l ( M t,x t ) = lnp M t (x t ). 8

Outline 1 Probabilistic Graphical Models 2 Online Density Estimation 3 Online Forest Density Estimation 4 Conclusions 9

Markov Forests X 3 X 2 0.01 0.49 0.48 0.02 X 9 X 8 X 7 X 2 X 6 X 15 X 5 X 1 0.75 0.25 X 3 X 13 X 10 X 4 X 12 X 11 X 14 For a set of n random variables defined over the discrete domain {0,1,,m 1}, the class of (m-ary n-dimensional) Markov Forests is given by the product F m,n = F n Θ m,n, where F n is the space of all acyclic graphs of order n; Θ m,n is the space of all parameter vectors mapping a probability table θ i [0,1] m to each candidate node i, and a probability table θ ij [0,1] m m to each candidate edge (i,j). 10

Markov Forests Two key properties For the class of Markov forests, The probability distribution associated with a Markov forest M = (f,θ) can be factorized into a closed-form: ( ) fij n θij (x i,x j ) P M (x) = θ i (x i ) θ i (x i )θ j (x j ) i=1 (i,j) ( n 2 ) So, the log-loss extended to convf n Θ m,n is an affine function of the structure: l(p,θ,x) = ψ(x) + p,φ(x),where ψ(x) = ( ) 1 θi ln θ i [n] i (x i ) and φ (x i )θ j (x j ) ij(x i,x j ) = ln θ ij (x i,x j ) The space F n of forest structures is a matroid; minimizing a linear function over F n can be done in quadratic time using the matroid greedy algorithm. 11

The Algorithm set θ 1 = U m,n set p 1 = 0 for each trial t = 1,,T play M t = (f t,θ t ), where f t = SWAP 1(p t ) receive x t set θ i t+1 (u) = t u + 1/2 for all n nodes t + m/2 set θ ij t+1 (u,v) = t uv + 1/2 t + m 2 /2 [ ] ( draw r t in 0, 1 n 2 ) β uniformly at random t Initialization Parameter Update for all ( n) (Jeffreys rule) 2 possible edges set f t+ 1 2 = argmin f F n f,r t + t s=1 φ s (x s ) set p t+ 1 2 = α t p t +(1 α t )f t+ 2 1 ( ) set p t+1 = SWAP k p t+ 1 2 Structure Update (Mixture of perturbed leaders) SWAP k uses random base exchanges to derive a convex mixture with at most k components. 12

Analysis Based on the closed-form expression of Markov forests, parameter updates and structure updates can be analyzed in an independent way: ( ) ( ) R M 1:T,x 1:T = R p 1:T,x 1:T + R (θ ) 1:T,x 1:T where ( ) R p 1:T,x 1:T T = l(p t,θ t,x t ) l(p,θ t,x t ) t=1 ( ) R θ 1:T,x 1:T T = l(p,θ t,x t ) l(p,θ,x t ) t=1 (Structural Regret) (Parametric Regret) 13

Analysis Parametric Regret Decomposable into local regrets, which can be bounded using universal coding techniques: ( ) R θ 1:T,x 1:T = n ln θ i i=1 θ i 1:T + ln (i,j) F + (i,j) F (x i 1:T ) (x i 1:T ) θ ij (x1:t ) ij θ ij 1:T (x ij 1:T ) ln θ1:t i θ i (x 1:T i (x 1:T i ) ) θ 1:T j θ j (x j 1:T ) (x j 1:T ) Univariate estimators Bivariate estimators Bivariate compensation Using symmetric Dirichlet mixtures for the parametric estimators, T θ 1:T (x 1:T ) = P λ (x t )p µ (λ)dλ = Γ(mµ) mv=1 Γ(t v + µ) t=1 Γ(µ) m Γ(t + mµ) the parametric regret for µ = 1 2 (Jeffreys mixture) is in O(lnT ). The per-round time complexity for parameter updates is in O(m 2 n 2 ). 14

Analysis Structural Regret Based on the telescopic decomposition (and using l t = φ t (x t )), ( ) R p 1:T,x 1:T T = p t,l t p t+ 1 2,l t 0 t=1 T + p t+ 1 2,l t f t+ 1 2,l t Convex mixture t=1 T + f t+ 1 2,l t p,l t Follow the Perturbed Leader t=1 (Kalai and Vempala, 2005) the structural regret is in O( T lnt ). The per-round time complexity for structure updates is in in O(n 2 logn + kn 2 ). 15

Preliminary Experiments Average log-loss 7 6 5 4 3 CL CLT OFDE F OFDE T Average log-loss 10 9.5 9 8.5 8 7.5 7 CL CLT OFDE F OFDE T 2 10 100 1000 10000 100000 KDD Cup 2000: Iterations 6.5 6 10 100 1000 10000 100000 MSNBC: Iterations The OFDE algorithm (with F for forests, and T for trees, k = lnn) was compared to batch algorithms (Chow-Liu (1968) for trees, and Chow-Liu with Thresholding (2011) for forests), which had the benefit of hindsight for the train set. The average log-loss was measured on the test set at the end of each iteration. 16

Outline 1 Probabilistic Graphical Models 2 Online Density Estimation 3 Online Forest Density Estimation 4 Conclusions 17

Conclusions Online density estimation has very attractive properties Designed for adversarial environments Naturally suited to streaming applications Can be applied to large-scale applications with massive amounts of data Online graphical density estimation is challenging We are faced with a tradeoff between minimax optimality and computational complexity: Minimax optimality often requires super-exponential time. Online approximation algorithms (Kakade et al., 2009) look promising for handling more expressive graphical models (ex: polytrees, bounded treewidth networks). 18

Thank You! (and thanks to the anonymous reviewers for helping to improve this paper!) 19

References C. K. Chow and C. N. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462 467, 1968. S. Kakade, A. Kalai, and K. Ligett. Playing games with approximation algorithms. SIAM J. Comput., 39(3): 1088 1106, 2009. A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005. W. Kotłowski and P. Grünwald. Maximum likelihood vs. sequential normalized maximum likelihood in on-line density estimation. In Proc. of COLT, pages 457 476, 2011. E. Takimoto and M. Warmuth. The last-step minimax algorithm. In Proc. of ALT, pages 279 290, 2000. V. Tan, A. Anandkumar, and A. Willsky. Learning high-dimensional Markov forest distributions: Analysis of error rates. Journal of Machine Learning Research, 12:1617 1653, 2011. Q. Xie and A. Barron. Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Transactions on Information Theory, 46(2):431 445, 2000. 20