Online Learning of Probabilistic Graphical Models

Similar documents
Online Forest Density Estimation

Learning with Large Number of Experts: Component Hedge Algorithm

Bayesian Methods for Machine Learning

Learning in Bayesian Networks

Structure Learning: the good, the bad, the ugly

Machine learning: lecture 20. Tommi S. Jaakkola MIT CSAIL

Probabilistic Graphical Models (I)

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Bayesian Network Structure Learning using Factorized NML Universal Models

Scalable robust hypothesis tests using graphical models

Advanced Machine Learning

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Graphical Models for Collaborative Filtering

The Monte Carlo Method: Bayesian Networks

STA 4273H: Statistical Machine Learning

Online Bayesian Passive-Aggressive Learning

4: Parameter Estimation in Fully Observed BNs

Learning Bayesian network : Given structure and completely observed data

STA 4273H: Statistical Machine Learning

Stochastic Proximal Gradient Algorithm

Directed Graphical Models

Probabilistic Graphical Models

Machine learning: lecture 20. Tommi S. Jaakkola MIT CSAIL

Lecture 5: Bayesian Network

Probabilistic Graphical Models

CSE 473: Artificial Intelligence Autumn Topics

Learning Bayes Net Structures

Probabilistic Graphical Models

Online Learning and Sequential Decision Making

Chris Bishop s PRML Ch. 8: Graphical Models

Lecture 6: Graphical Models: Learning

Probabilistic Graphical Models

Online Learning, Mistake Bounds, Perceptron Algorithm

Sum-Product Networks: A New Deep Architecture

Lecture 8: Bayesian Networks

15-780: Graduate Artificial Intelligence. Bayesian networks: Construction and inference

Overfitting, Bias / Variance Analysis

Learning, Games, and Networks

Dynamic Approaches: The Hidden Markov Model

Introduction to Probabilistic Graphical Models

BN Semantics 3 Now it s personal! Parameter Learning 1

Introduction to Bayesian Learning. Machine Learning Fall 2018

CPSC 540: Machine Learning

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Bayesian Machine Learning

Advanced Machine Learning & Perception

CSCI-567: Machine Learning (Spring 2019)

Lecture 8: December 17, 2003

Learning Bayesian networks

CPSC 540: Machine Learning

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Exact model averaging with naive Bayesian classifiers

11 : Gaussian Graphic Models and Ising Models

STA 4273H: Statistical Machine Learning

Achievability of Asymptotic Minimax Regret in Online and Batch Prediction

Learning' Probabilis2c' Graphical' Models' BN'Structure' Structure' Learning' Daphne Koller

Introduction to Probabilistic Graphical Models: Exercises

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Chapter 16. Structured Probabilistic Models for Deep Learning

Online Kernel PCA with Entropic Matrix Updates

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Expectation Propagation Algorithm

{ p if x = 1 1 p if x = 0

Machine Learning Summer School

Graphical Models and Kernel Methods

Learning P-maps Param. Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

PROBABILISTIC REASONING SYSTEMS

Notes on MapReduce Algorithms

Collapsed Variational Inference for Sum-Product Networks

An Introduction to Bayesian Machine Learning

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Bayesian Networks Inference with Probabilistic Graphical Models

Directed Graphical Models or Bayesian Networks

1 Bayesian Linear Regression (BLR)

Introduction to continuous and hybrid. Bayesian networks

Introduction to Graphical Models

Introduction to Probability and Statistics (Continued)

Online Algorithms for Sum-Product

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Review: Bayesian learning and inference

Probabilistic Graphical Models

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Dynamic Matrix-Variate Graphical Models A Synopsis 1

Intelligent Systems:

Lecture 9: PGM Learning

10708 Graphical Models: Homework 2

Achievability of Asymptotic Minimax Regret by Horizon-Dependent and Horizon-Independent Strategies

Scoring functions for learning Bayesian networks

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Grundlagen der Künstlichen Intelligenz

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Stochastic prediction of train delays with dynamic Bayesian networks. Author(s): Kecman, Pavle; Corman, Francesco; Peterson, Anders; Joborn, Martin

Vers un apprentissage subquadratique pour les mélanges d arbres

CS599: Convex and Combinatorial Optimization Fall 2013 Lecture 17: Combinatorial Problems as Linear Programs III. Instructor: Shaddin Dughmi

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Online Learning Summer School Copenhagen 2015 Lecture 1

Transcription:

1/34 Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U Nankin 2016

Probabilistic Graphical Models 2/34 Outline 1 Probabilistic Graphical Models 2 Online Learning 3 Online Learning of Markov Forests 4 Beyond Markov Forests

Probabilistic Graphical Models 3/34 Graphical Models Bioinformatics Financial Modeling Image Processing Social Networks Analysis Graphical Models At the intersection of Probability Theory and Graph Theory, graphical models are a well-studied representation framework with various applications.

Probabilistic Graphical Models 4/34 000000000000 000000000001 000000000010 000000000011 000000000100 000000000101 000000000110 000000000111... 111111111000 111111111001 111111111010 111111111011 111111111100 111111111101 111111111110 111111111111 0.01 0.49 0.48 0.02 X 12 X 9 X 8 X 3 X 11 X 7 X 2 X 10 X 6 X 1 X 4 X 5 0.75 0.25 Graphical Models Encoding high-dimensional distributions in a compact and intuitive way: Qualitative uncertainty (interdependencies) is captured by the structure Quantitative uncertainty (probabilities) is captured by the parameters

Probabilistic Graphical Models 5/34 Graphical Models Bayesian Networks Sparse Bayesian Networks Bayesian Polytrees Bayesian Forests Markov Networks Sparse Markov Networks Bounded Tree-width MNs Markov Forests Classes of Graphical Models For an outcome space X R n, a class of graphical models is a pair M = G Θ, where G is space of n-dimensional graphs, and Θ is a space of d-dimensional vectors. G captures structural constraints (directed vs. undirected, sparse vs. dense, etc.) Θ captures parametric constraints (binomial, multinomial, Gaussian, etc.)

Probabilistic Graphical Models 6/34 P(X 2,X 3 ) 00 0.01 10 0.49 01 0.48 11 0.02 X 9 X 8 X 7 X 2 X 6 X 15 X 5 X 1 P(X 1 ) 0 0.75 1 0.25 X 3 X 13 X 12 X 11 X 10 X 14 X 4 (Multinomial) Markov Forests Using [m] = {0,,m 1}, the class of Markov Forests over [m] n is given by F n Θ m,n, where F n is the space of all acyclic graphs of order n; Θ m,n is the space of all parameter vectors mapping a probability table θ i [0,1] m to each candidate node i, and a probability table θ ij [0,1] m m to each candidate edge (i,j).

Probabilistic Graphical Models 7/34 P(X 3 X 2 ) 00 0.01 10 0.99 01 0.98 11 0.02 X 9 X 8 X 7 X 2 X 6 X 15 X 5 X 1 P(X 1 ) 0 0.75 1 0.25 X 3 X 13 X 12 X 11 X 10 X 14 X 4 (Multinomial) Bayesian Forests The class of Bayesian Forests over [m] n is given by F n Θ m,n, where F n is the space of all directed forests of order n; Θ m,n is the space of all parameter vectors mapping a probability table θ i [0,1] m to each candidate node i, and a conditional probability table θ j i [m] [0,1] m to each candidate arc (i,j).

Probabilistic Graphical Models 8/34 Earthquake P(A,E) 00 0.49 10 0.01 01 0.48 11 0.02 Burglary Radio Alarm Call Example: The Alarm Model Markov tree representation Bayesian tree representation

Probabilistic Graphical Models 8/34 Earthquake P(A E) 00 0.65 10 0.35 01 0.01 11 0.99 Burglary Radio Alarm Call Example: The Alarm Model Markov tree representation Bayesian tree representation

Online Learning 9/34 Outline 1 Probabilistic Graphical Models 2 Online Learning 3 Online Learning of Markov Forests 4 Beyond Markov Forests

Online Learning 10/34 E B R A C 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0. 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 B A C E R 0 0.65 1 0.35 Learning Given a class of graphical models M = G Θ, the learning problem is to extract from a sequence of outcomes x 1:T = (x 1,,x T ), the structure G G, and the parameters θ Θ of a generative model M = (G, θ) capable of predicting future, unseen, outcomes. A natural loss function for measuring the performance of M is the log-loss: l(m,x) = lnp M (x)

Online Learning 10/34 E B R A C 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0. 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 B A C E R 0 0.65 1 0.35 Learning Given a class of graphical models M = G Θ, the learning problem is to extract from a sequence of outcomes x 1:T = (x 1,,x T ), the structure G G, and the parameters θ Θ of a generative model M = (G, θ) capable of predicting future, unseen, outcomes. A natural loss function for measuring the performance of M is the log-loss: l(m,x) = lnp M (x)

Online Learning 10/34 E B R A C 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0. 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 B A C E R 0 0.65 1 0.35 Learning Given a class of graphical models M = G Θ, the learning problem is to extract from a sequence of outcomes x 1:T = (x 1,,x T ), the structure G G, and the parameters θ Θ of a generative model M = (G, θ) capable of predicting future, unseen, outcomes. A natural loss function for measuring the performance of M is the log-loss: l(m,x) = lnp M (x)

Online Learning 11/34 000000010000 000000010001 000000000010 000000000011 010000000101 010000000101 100000000111 100000000111. 110111111000 110111111001 111110111010 110111111011 111011111101 111011111101 111011111101 111111110111 M = (G,θ) Graphical Model Training Set Batch Learning A two-stage process: A model M M is first extracted from a training set. The average loss of the model M is evalued using a test set.

Online Learning 11/34 M = (G,θ) Graphical Model 1 T Tt=1 l(m,x t ) 000000010000 000000010001 000000000010 000000000011 010000000101 010000000101 100000000111. 110111111000 110111111001 111110111010 111011111101 111011111101 111011111101 111111110111 Test Set of size T Batch Learning A two-stage process: A model M M is first extracted from a training set. The average loss of the model M is evalued using a test set.

Online Learning 12/34 Environment Learner Online Learning A sequential process, or repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses a model M t M ; the environment responds by an outcome x t X, and the learner incurs the loss l(m t,x t ).

Online Learning 12/34 Environment M t = ( G t,θ t) Learner Online Learning A sequential process, or repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses a model M t M ; the environment responds by an outcome x t X, and the learner incurs the loss l(m t,x t ).

Online Learning 12/34 l ( M t,θ t) Environment 100000000111 M t = ( G t,θ t) Learner Online Learning A sequential process, or repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses a model M t M ; the environment responds by an outcome x t X, and the learner incurs the loss l(m t,x t ).

Online Learning 12/34 Environment M t+1 = ( G t+1,θ t+1) Learner Online Learning A sequential process, or repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses a model M t M ; the environment responds by an outcome x t X, and the learner incurs the loss l(m t,x t ).

Online Learning 12/34 l ( M t+1,θ t+1) Environment 100010010000 M t+1 = ( G t+1,θ t+1) Learner Online Learning A sequential process, or repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses a model M t M ; the environment responds by an outcome x t X, and the learner incurs the loss l(m t,x t ).

Online Learning 13/34 Online vs. Batch In batch (PAC) learning, it is assumed that the data (training set + test set) is generated by a fixed (but unknown) probability distribution. In online learning, there is no statistical assumption about the data generated by the environment. Online learning is particularly suited to: * Adaptive environments, where the target distribution can change over time; * Streaming applications, where all the data is not available in advance; * Large-scale datasets, by processing only one outcome at a time.

Online Learning 13/34 Online vs. Batch In batch (PAC) learning, it is assumed that the data (training set + test set) is generated by a fixed (but unknown) probability distribution. In online learning, there is no statistical assumption about the data generated by the environment. Online learning is particularly suited to: * Adaptive environments, where the target distribution can change over time; * Streaming applications, where all the data is not available in advance; * Large-scale datasets, by processing only one outcome at a time.

Online Learning 14/34 Regret Let M be a class of graphical models over an outcome space X, and A be a learning algorithm for M. The minimax regret of A at horizon T, is the maximum, over every sequence of outcomes x 1:T = (x 1,,x T ), of the cumulative relative loss between A and the best model in M, i.e. [ ] T R(A,T) = max l(m t,x t T ) min l(m,x t ) x 1:T X T t=1 M M t=1 Learnability A class M is (online) learnable if it admits an online learning algorithm A such that: 1 the minimax regret of A is sublinear in T, i.e. lim R(A,T) = 0 T 2 the per-round computational complexity of A is polynomial in the dimension of M.

Online Learning 14/34 Regret Let M be a class of graphical models over an outcome space X, and A be a learning algorithm for M. The minimax regret of A at horizon T, is the maximum, over every sequence of outcomes x 1:T = (x 1,,x T ), of the cumulative relative loss between A and the best model in M, i.e. [ ] T R(A,T) = max l(m t,x t T ) min l(m,x t ) x 1:T X T t=1 M M t=1 Learnability A class M is (online) learnable if it admits an online learning algorithm A such that: 1 the minimax regret of A is sublinear in T, i.e. lim R(A,T) = 0 T 2 the per-round computational complexity of A is polynomial in the dimension of M.

Online Learning of Markov Forests 15/34 Outline 1 Probabilistic Graphical Models 2 Online Learning 3 Online Learning of Markov Forests Regret Decomposition Parametric Regret Structural Regret 4 Beyond Markov Forests

Online Learning of Markov Forests 16/34 X 3 X 2 0.01 0.49 0.48 0.02 X 9 X 8 X 7 X 2 X 6 X 15 X 5 X 1 0.75 0.25 X 3 X 13 X 12 X 11 X 10 X 14 X 4 Does there exist an efficient online learning algorithm for Markov forests? How can we efficiently update at each iteration both the structure F t and the parameters θ t in order to minimize the cumulative loss t l(f t,θ t,x t )?

Online Learning of Markov Forests 16/34 X 3 X 2 0.05 0.45 0.45 0.05 X 9 X 8 X 7 X 2 X 6 X 15 X 5 X 1 0.65 0.35 X 3 X 13 X 12 X 11 X 10 X 14 X 4 Does there exist an efficient online learning algorithm for Markov forests? How can we efficiently update at each iteration both the structure F t and the parameters θ t in order to minimize the cumulative loss t l(f t,θ t,x t )?

Online Learning of Markov Forests 17/34 Two key properties For the class F m,n = F n Θ m,n of Markov forests, The probability distribution associated with a Markov forest M = (F, θ) can be factorized into a closed-form: n θ ij (x i,x j ) P M (x) = θ i (x i ) i=1 (i,j) F θ i (x i )θ j (x j ) The space F n of forest structures is a matroid; minimizing a linear function over F n can be done in quadratic time using the matroid greedy algorithm.

Online Learning of Markov Forests 17/34 Two key properties For the class F m,n = F n Θ m,n of Markov forests, The probability distribution associated with a Markov forest M = (F, θ) can be factorized into a closed-form: n θ ij (x i,x j ) P M (x) = θ i (x i ) i=1 (i,j) F θ i (x i )θ j (x j ) The space F n of forest structures is a matroid; minimizing a linear function over F n can be done in quadratic time using the matroid greedy algorithm.

Online Learning of Markov Forests 18/34 Log-Loss Let M = (f,θ) be a Markov forest, where f is the characteristic vector of the structure. l(m,x) = lnp M (x) = ln θ i (x i ) i=1 (i,j) ( θij (x i,x j ) θ i (x i )θ j (x j ) ) fij So, the log-loss is an affine function of the forest structure: l(m,x) = ψ(x) + f,φ(x) where ψ(x) = i [n] ln 1 θ i (x i ) and φ ij (x i,x j ) = ln ( ) θi (x i )θ j (x j ) θ ij (x i,x j )

Online Learning of Markov Forests 18/34 Log-Loss Let M = (f,θ) be a Markov forest, where f is the characteristic vector of the structure. l(m,x) = lnp M (x) = ln θ i (x i ) i=1 (i,j) ( θij (x i,x j ) θ i (x i )θ j (x j ) ) fij So, the log-loss is an affine function of the forest structure: l(m,x) = ψ(x) + f,φ(x) where ψ(x) = i [n] ln 1 θ i (x i ) and φ ij (x i,x j ) = ln ( ) θi (x i )θ j (x j ) θ ij (x i,x j )

Online Learning of Markov Forests Regret Decomposition 19/34 Regret Decomposition Based on the linearity of the log-loss, the regret is decomposable into two parts: ( ) ( ) R M 1:T,x 1:T = R f 1:T,x 1:T + R (θ ) 1:T,x 1:T where ( ) R f 1:T,x 1:T T = l(f t,θ t,x t ) l(f,θ t,x t ) t=1 ( ) R θ 1:T,x 1:T T = l(f,θ t,x t ) l(f,θ,x t ) t=1 (Structural Regret) (Parametric Regret)

Online Learning of Markov Forests Regret Decomposition 19/34 Regret Decomposition Based on the linearity of the log-loss, the regret is decomposable into two parts: ( ) ( ) R M 1:T,x 1:T = R f 1:T,x 1:T + R (θ ) 1:T,x 1:T where ( ) R f 1:T,x 1:T T = l(f t,θ t,x t ) l(f,θ t,x t ) t=1 ( ) R θ 1:T,x 1:T T = l(f,θ t,x t ) l(f,θ,x t ) t=1 (Structural Regret) (Parametric Regret)

Online Learning of Markov Forests Parametric Regret 20/34 Parametric Regret Based on the closed-form expression of Markov forests, the parametric regret is decomposable into local regrets: ( ) R θ 1:T,x 1:T = n ln θ i (x1:t ) i θ i 1:T (x i 1:T ) i=1 + ln (i,j) F + (i,j) F θ ij (x1:t ) ij θ ij 1:T (x ij 1:T ) ln θ1:t i (x i 1:T ) ) θ i (x1:t i θ 1:T j (x j 1:T ) θ j (x1:t ) j (Univariate estimators) (Bivariate estimators) (Bivariate compensation) where θi T (x1:t i ) = θi (xt i ), t=1 θij T (x1:t ij ) = θij (xt i,xt j ), t=1 θ1:t i (xi 1:T T ) = θi t (xt i ) t=1 θ1:t ij (xij 1:T T ) = θij t (xt i,xt j ) t=1

Online Learning of Markov Forests Parametric Regret 20/34 Parametric Regret Based on the closed-form expression of Markov forests, the parametric regret is decomposable into local regrets: ( ) R θ 1:T,x 1:T = n ln θ i (x1:t ) i θ i 1:T (x i 1:T ) i=1 + ln (i,j) F + (i,j) F θ ij (x1:t ) ij θ ij 1:T (x ij 1:T ) ln θ1:t i (x i 1:T ) ) θ i (x1:t i θ 1:T j (x j 1:T ) θ j (x1:t ) j (Univariate estimators) (Bivariate estimators) (Bivariate compensation) where θi T (x1:t i ) = θi (xt i ), t=1 θij T (x1:t ij ) = θij (xt i,xt j ), t=1 θ1:t i (xi 1:T T ) = θi t (xt i ) t=1 θ1:t ij (xij 1:T T ) = θij t (xt i,xt j ) t=1

Online Learning of Markov Forests Parametric Regret 21/34 Local regrets Expressions of the form θ (x 1:T ) θ 1:T (x 1:T ) Use Dirichlet Mixtures have been extensively studied in literature of universal coding (Grünwald, 2007). Dirichlet Mixtures Using symmetric Dirichlet mixtures for the parametric estimators, θ 1:T (x 1:T ) = T t=1 If µ = 2 1, we get the Jeffreys mixture. P λ (x t )p µ (λ)dλ = Γ(mµ) mv=1 Γ(t v + µ) Γ(µ) m Γ(t + mµ)

Online Learning of Markov Forests Parametric Regret 21/34 Local regrets Expressions of the form θ (x 1:T ) θ 1:T (x 1:T ) Use Dirichlet Mixtures have been extensively studied in literature of universal coding (Grünwald, 2007). Dirichlet Mixtures Using symmetric Dirichlet mixtures for the parametric estimators, θ 1:T (x 1:T ) = T t=1 If µ = 2 1, we get the Jeffreys mixture. P λ (x t )p µ (λ)dλ = Γ(mµ) mv=1 Γ(t v + µ) Γ(µ) m Γ(t + mµ)

Online Learning of Markov Forests Parametric Regret 21/34 Local regrets Expressions of the form θ (x 1:T ) θ 1:T (x 1:T ) Use Dirichlet Mixtures have been extensively studied in literature of universal coding (Grünwald, 2007). Dirichlet Mixtures Using symmetric Dirichlet mixtures for the parametric estimators, θ 1:T (x 1:T ) = T t=1 If µ = 2 1, we get the Jeffreys mixture. P λ (x t )p µ (λ)dλ = Γ(mµ) mv=1 Γ(t v + µ) Γ(µ) m Γ(t + mµ)

Online Learning of Markov Forests Parametric Regret 22/34 Jeffreys Strategy (An extension of Xie and Barron (2000) to forest parameters) For each trial t, 1 Set θ i t+1 (u) = t u + 2 1 t + m 2 2 Set θ ij t+1 (u,v) = t uv + 2 1 t + m2 2 for all i [n],u [m] for all (i,j) ( [n]) 2,u,v [m] Performance n(m 1) + (n 1)(m 1)2 Minimax regret: ln T 2 2π + C m,n + o(m 2 n) Per-round time complexity: O(m 2 n 2 )

Online Learning of Markov Forests Parametric Regret 22/34 Jeffreys Strategy (An extension of Xie and Barron (2000) to forest parameters) For each trial t, 1 Set θ i t+1 (u) = t u + 2 1 t + m 2 2 Set θ ij t+1 (u,v) = t uv + 2 1 t + m2 2 for all i [n],u [m] for all (i,j) ( [n]) 2,u,v [m] Performance n(m 1) + (n 1)(m 1)2 Minimax regret: ln T 2 2π + C m,n + o(m 2 n) Per-round time complexity: O(m 2 n 2 )

Online Learning of Markov Forests Structural Regret 23/34 Structural Regret Based on the linear form of the log-loss, the structural learning problem can be cast as a sequential combinatorial optimization problem (Audibert et al., 2011). Minimize [ ] T t f s,φ(x s ) t=1 s=1 subject to f 1,,f T F n

Online Learning of Markov Forests Structural Regret 23/34 Structural Regret Based on the linear form of the log-loss, the structural learning problem can be cast as a sequential combinatorial optimization problem (Audibert et al., 2011). Minimize subject to [ ] T t f s,φ(x s ) t=1 s=1 Use the greedy algorithm f 1,,f T F n

Online Learning of Markov Forests Structural Regret 24/34 Follow the Perturbed Leader (Kalai and Vempala, 2005) For each trial t, 1 Draw r t in [0, 1 α t ] uniformly at random 2 Set f t+1 = argmin f Fn f,l t + r t where L t = t s=1 φ(x s ) Performance Minimax regret: n 2 ln(t/2 + m 2 /4) 2T Per-round time complexity: O(n 2 logn)

Online Learning of Markov Forests Structural Regret 24/34 Follow the Perturbed Leader (Kalai and Vempala, 2005) For each trial t, 1 Draw r t in [0, 1 α t ] uniformly at random 2 Set f t+1 = argmin f Fn f,l t + r t where L t = t s=1 φ(x s ) Performance Minimax regret: n 2 ln(t/2 + m 2 /4) 2T Per-round time complexity: O(n 2 logn)

Online Learning of Markov Forests Structural Regret 25/34 Average log-loss 7 6 5 4 3 2 CL CLT OFDEF OFDET 10 100 1000 10000 100000 KDD Cup 2000: Iterations Experiments The average log-loss is measured on the test set at each iteration. The online learner (FPL + Jeffreys) rapidly converges to the batch learner (Chow-Liu for Markov trees, and Thresholded Chow-Liu for Markov forests).

Online Learning of Markov Forests Structural Regret 25/34 Average log-loss 10 9.5 9 8.5 8 7.5 7 6.5 CL CLT OFDEF OFDET 6 10 100 1000 10000 100000 MSNBC: Iterations Experiments The average log-loss is measured on the test set at each iteration. The online learner (FPL + Jeffreys) rapidly converges to the batch learner (Chow-Liu for Markov trees, and Thresholded Chow-Liu for Markov forests).