1/34 Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U Nankin 2016
Probabilistic Graphical Models 2/34 Outline 1 Probabilistic Graphical Models 2 Online Learning 3 Online Learning of Markov Forests 4 Beyond Markov Forests
Probabilistic Graphical Models 3/34 Graphical Models Bioinformatics Financial Modeling Image Processing Social Networks Analysis Graphical Models At the intersection of Probability Theory and Graph Theory, graphical models are a well-studied representation framework with various applications.
Probabilistic Graphical Models 4/34 000000000000 000000000001 000000000010 000000000011 000000000100 000000000101 000000000110 000000000111... 111111111000 111111111001 111111111010 111111111011 111111111100 111111111101 111111111110 111111111111 0.01 0.49 0.48 0.02 X 12 X 9 X 8 X 3 X 11 X 7 X 2 X 10 X 6 X 1 X 4 X 5 0.75 0.25 Graphical Models Encoding high-dimensional distributions in a compact and intuitive way: Qualitative uncertainty (interdependencies) is captured by the structure Quantitative uncertainty (probabilities) is captured by the parameters
Probabilistic Graphical Models 5/34 Graphical Models Bayesian Networks Sparse Bayesian Networks Bayesian Polytrees Bayesian Forests Markov Networks Sparse Markov Networks Bounded Tree-width MNs Markov Forests Classes of Graphical Models For an outcome space X R n, a class of graphical models is a pair M = G Θ, where G is space of n-dimensional graphs, and Θ is a space of d-dimensional vectors. G captures structural constraints (directed vs. undirected, sparse vs. dense, etc.) Θ captures parametric constraints (binomial, multinomial, Gaussian, etc.)
Probabilistic Graphical Models 6/34 P(X 2,X 3 ) 00 0.01 10 0.49 01 0.48 11 0.02 X 9 X 8 X 7 X 2 X 6 X 15 X 5 X 1 P(X 1 ) 0 0.75 1 0.25 X 3 X 13 X 12 X 11 X 10 X 14 X 4 (Multinomial) Markov Forests Using [m] = {0,,m 1}, the class of Markov Forests over [m] n is given by F n Θ m,n, where F n is the space of all acyclic graphs of order n; Θ m,n is the space of all parameter vectors mapping a probability table θ i [0,1] m to each candidate node i, and a probability table θ ij [0,1] m m to each candidate edge (i,j).
Probabilistic Graphical Models 7/34 P(X 3 X 2 ) 00 0.01 10 0.99 01 0.98 11 0.02 X 9 X 8 X 7 X 2 X 6 X 15 X 5 X 1 P(X 1 ) 0 0.75 1 0.25 X 3 X 13 X 12 X 11 X 10 X 14 X 4 (Multinomial) Bayesian Forests The class of Bayesian Forests over [m] n is given by F n Θ m,n, where F n is the space of all directed forests of order n; Θ m,n is the space of all parameter vectors mapping a probability table θ i [0,1] m to each candidate node i, and a conditional probability table θ j i [m] [0,1] m to each candidate arc (i,j).
Probabilistic Graphical Models 8/34 Earthquake P(A,E) 00 0.49 10 0.01 01 0.48 11 0.02 Burglary Radio Alarm Call Example: The Alarm Model Markov tree representation Bayesian tree representation
Probabilistic Graphical Models 8/34 Earthquake P(A E) 00 0.65 10 0.35 01 0.01 11 0.99 Burglary Radio Alarm Call Example: The Alarm Model Markov tree representation Bayesian tree representation
Online Learning 9/34 Outline 1 Probabilistic Graphical Models 2 Online Learning 3 Online Learning of Markov Forests 4 Beyond Markov Forests
Online Learning 10/34 E B R A C 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0. 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 B A C E R 0 0.65 1 0.35 Learning Given a class of graphical models M = G Θ, the learning problem is to extract from a sequence of outcomes x 1:T = (x 1,,x T ), the structure G G, and the parameters θ Θ of a generative model M = (G, θ) capable of predicting future, unseen, outcomes. A natural loss function for measuring the performance of M is the log-loss: l(m,x) = lnp M (x)
Online Learning 10/34 E B R A C 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0. 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 B A C E R 0 0.65 1 0.35 Learning Given a class of graphical models M = G Θ, the learning problem is to extract from a sequence of outcomes x 1:T = (x 1,,x T ), the structure G G, and the parameters θ Θ of a generative model M = (G, θ) capable of predicting future, unseen, outcomes. A natural loss function for measuring the performance of M is the log-loss: l(m,x) = lnp M (x)
Online Learning 10/34 E B R A C 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0. 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 B A C E R 0 0.65 1 0.35 Learning Given a class of graphical models M = G Θ, the learning problem is to extract from a sequence of outcomes x 1:T = (x 1,,x T ), the structure G G, and the parameters θ Θ of a generative model M = (G, θ) capable of predicting future, unseen, outcomes. A natural loss function for measuring the performance of M is the log-loss: l(m,x) = lnp M (x)
Online Learning 11/34 000000010000 000000010001 000000000010 000000000011 010000000101 010000000101 100000000111 100000000111. 110111111000 110111111001 111110111010 110111111011 111011111101 111011111101 111011111101 111111110111 M = (G,θ) Graphical Model Training Set Batch Learning A two-stage process: A model M M is first extracted from a training set. The average loss of the model M is evalued using a test set.
Online Learning 11/34 M = (G,θ) Graphical Model 1 T Tt=1 l(m,x t ) 000000010000 000000010001 000000000010 000000000011 010000000101 010000000101 100000000111. 110111111000 110111111001 111110111010 111011111101 111011111101 111011111101 111111110111 Test Set of size T Batch Learning A two-stage process: A model M M is first extracted from a training set. The average loss of the model M is evalued using a test set.
Online Learning 12/34 Environment Learner Online Learning A sequential process, or repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses a model M t M ; the environment responds by an outcome x t X, and the learner incurs the loss l(m t,x t ).
Online Learning 12/34 Environment M t = ( G t,θ t) Learner Online Learning A sequential process, or repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses a model M t M ; the environment responds by an outcome x t X, and the learner incurs the loss l(m t,x t ).
Online Learning 12/34 l ( M t,θ t) Environment 100000000111 M t = ( G t,θ t) Learner Online Learning A sequential process, or repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses a model M t M ; the environment responds by an outcome x t X, and the learner incurs the loss l(m t,x t ).
Online Learning 12/34 Environment M t+1 = ( G t+1,θ t+1) Learner Online Learning A sequential process, or repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses a model M t M ; the environment responds by an outcome x t X, and the learner incurs the loss l(m t,x t ).
Online Learning 12/34 l ( M t+1,θ t+1) Environment 100010010000 M t+1 = ( G t+1,θ t+1) Learner Online Learning A sequential process, or repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses a model M t M ; the environment responds by an outcome x t X, and the learner incurs the loss l(m t,x t ).
Online Learning 13/34 Online vs. Batch In batch (PAC) learning, it is assumed that the data (training set + test set) is generated by a fixed (but unknown) probability distribution. In online learning, there is no statistical assumption about the data generated by the environment. Online learning is particularly suited to: * Adaptive environments, where the target distribution can change over time; * Streaming applications, where all the data is not available in advance; * Large-scale datasets, by processing only one outcome at a time.
Online Learning 13/34 Online vs. Batch In batch (PAC) learning, it is assumed that the data (training set + test set) is generated by a fixed (but unknown) probability distribution. In online learning, there is no statistical assumption about the data generated by the environment. Online learning is particularly suited to: * Adaptive environments, where the target distribution can change over time; * Streaming applications, where all the data is not available in advance; * Large-scale datasets, by processing only one outcome at a time.
Online Learning 14/34 Regret Let M be a class of graphical models over an outcome space X, and A be a learning algorithm for M. The minimax regret of A at horizon T, is the maximum, over every sequence of outcomes x 1:T = (x 1,,x T ), of the cumulative relative loss between A and the best model in M, i.e. [ ] T R(A,T) = max l(m t,x t T ) min l(m,x t ) x 1:T X T t=1 M M t=1 Learnability A class M is (online) learnable if it admits an online learning algorithm A such that: 1 the minimax regret of A is sublinear in T, i.e. lim R(A,T) = 0 T 2 the per-round computational complexity of A is polynomial in the dimension of M.
Online Learning 14/34 Regret Let M be a class of graphical models over an outcome space X, and A be a learning algorithm for M. The minimax regret of A at horizon T, is the maximum, over every sequence of outcomes x 1:T = (x 1,,x T ), of the cumulative relative loss between A and the best model in M, i.e. [ ] T R(A,T) = max l(m t,x t T ) min l(m,x t ) x 1:T X T t=1 M M t=1 Learnability A class M is (online) learnable if it admits an online learning algorithm A such that: 1 the minimax regret of A is sublinear in T, i.e. lim R(A,T) = 0 T 2 the per-round computational complexity of A is polynomial in the dimension of M.
Online Learning of Markov Forests 15/34 Outline 1 Probabilistic Graphical Models 2 Online Learning 3 Online Learning of Markov Forests Regret Decomposition Parametric Regret Structural Regret 4 Beyond Markov Forests
Online Learning of Markov Forests 16/34 X 3 X 2 0.01 0.49 0.48 0.02 X 9 X 8 X 7 X 2 X 6 X 15 X 5 X 1 0.75 0.25 X 3 X 13 X 12 X 11 X 10 X 14 X 4 Does there exist an efficient online learning algorithm for Markov forests? How can we efficiently update at each iteration both the structure F t and the parameters θ t in order to minimize the cumulative loss t l(f t,θ t,x t )?
Online Learning of Markov Forests 16/34 X 3 X 2 0.05 0.45 0.45 0.05 X 9 X 8 X 7 X 2 X 6 X 15 X 5 X 1 0.65 0.35 X 3 X 13 X 12 X 11 X 10 X 14 X 4 Does there exist an efficient online learning algorithm for Markov forests? How can we efficiently update at each iteration both the structure F t and the parameters θ t in order to minimize the cumulative loss t l(f t,θ t,x t )?
Online Learning of Markov Forests 17/34 Two key properties For the class F m,n = F n Θ m,n of Markov forests, The probability distribution associated with a Markov forest M = (F, θ) can be factorized into a closed-form: n θ ij (x i,x j ) P M (x) = θ i (x i ) i=1 (i,j) F θ i (x i )θ j (x j ) The space F n of forest structures is a matroid; minimizing a linear function over F n can be done in quadratic time using the matroid greedy algorithm.
Online Learning of Markov Forests 17/34 Two key properties For the class F m,n = F n Θ m,n of Markov forests, The probability distribution associated with a Markov forest M = (F, θ) can be factorized into a closed-form: n θ ij (x i,x j ) P M (x) = θ i (x i ) i=1 (i,j) F θ i (x i )θ j (x j ) The space F n of forest structures is a matroid; minimizing a linear function over F n can be done in quadratic time using the matroid greedy algorithm.
Online Learning of Markov Forests 18/34 Log-Loss Let M = (f,θ) be a Markov forest, where f is the characteristic vector of the structure. l(m,x) = lnp M (x) = ln θ i (x i ) i=1 (i,j) ( θij (x i,x j ) θ i (x i )θ j (x j ) ) fij So, the log-loss is an affine function of the forest structure: l(m,x) = ψ(x) + f,φ(x) where ψ(x) = i [n] ln 1 θ i (x i ) and φ ij (x i,x j ) = ln ( ) θi (x i )θ j (x j ) θ ij (x i,x j )
Online Learning of Markov Forests 18/34 Log-Loss Let M = (f,θ) be a Markov forest, where f is the characteristic vector of the structure. l(m,x) = lnp M (x) = ln θ i (x i ) i=1 (i,j) ( θij (x i,x j ) θ i (x i )θ j (x j ) ) fij So, the log-loss is an affine function of the forest structure: l(m,x) = ψ(x) + f,φ(x) where ψ(x) = i [n] ln 1 θ i (x i ) and φ ij (x i,x j ) = ln ( ) θi (x i )θ j (x j ) θ ij (x i,x j )
Online Learning of Markov Forests Regret Decomposition 19/34 Regret Decomposition Based on the linearity of the log-loss, the regret is decomposable into two parts: ( ) ( ) R M 1:T,x 1:T = R f 1:T,x 1:T + R (θ ) 1:T,x 1:T where ( ) R f 1:T,x 1:T T = l(f t,θ t,x t ) l(f,θ t,x t ) t=1 ( ) R θ 1:T,x 1:T T = l(f,θ t,x t ) l(f,θ,x t ) t=1 (Structural Regret) (Parametric Regret)
Online Learning of Markov Forests Regret Decomposition 19/34 Regret Decomposition Based on the linearity of the log-loss, the regret is decomposable into two parts: ( ) ( ) R M 1:T,x 1:T = R f 1:T,x 1:T + R (θ ) 1:T,x 1:T where ( ) R f 1:T,x 1:T T = l(f t,θ t,x t ) l(f,θ t,x t ) t=1 ( ) R θ 1:T,x 1:T T = l(f,θ t,x t ) l(f,θ,x t ) t=1 (Structural Regret) (Parametric Regret)
Online Learning of Markov Forests Parametric Regret 20/34 Parametric Regret Based on the closed-form expression of Markov forests, the parametric regret is decomposable into local regrets: ( ) R θ 1:T,x 1:T = n ln θ i (x1:t ) i θ i 1:T (x i 1:T ) i=1 + ln (i,j) F + (i,j) F θ ij (x1:t ) ij θ ij 1:T (x ij 1:T ) ln θ1:t i (x i 1:T ) ) θ i (x1:t i θ 1:T j (x j 1:T ) θ j (x1:t ) j (Univariate estimators) (Bivariate estimators) (Bivariate compensation) where θi T (x1:t i ) = θi (xt i ), t=1 θij T (x1:t ij ) = θij (xt i,xt j ), t=1 θ1:t i (xi 1:T T ) = θi t (xt i ) t=1 θ1:t ij (xij 1:T T ) = θij t (xt i,xt j ) t=1
Online Learning of Markov Forests Parametric Regret 20/34 Parametric Regret Based on the closed-form expression of Markov forests, the parametric regret is decomposable into local regrets: ( ) R θ 1:T,x 1:T = n ln θ i (x1:t ) i θ i 1:T (x i 1:T ) i=1 + ln (i,j) F + (i,j) F θ ij (x1:t ) ij θ ij 1:T (x ij 1:T ) ln θ1:t i (x i 1:T ) ) θ i (x1:t i θ 1:T j (x j 1:T ) θ j (x1:t ) j (Univariate estimators) (Bivariate estimators) (Bivariate compensation) where θi T (x1:t i ) = θi (xt i ), t=1 θij T (x1:t ij ) = θij (xt i,xt j ), t=1 θ1:t i (xi 1:T T ) = θi t (xt i ) t=1 θ1:t ij (xij 1:T T ) = θij t (xt i,xt j ) t=1
Online Learning of Markov Forests Parametric Regret 21/34 Local regrets Expressions of the form θ (x 1:T ) θ 1:T (x 1:T ) Use Dirichlet Mixtures have been extensively studied in literature of universal coding (Grünwald, 2007). Dirichlet Mixtures Using symmetric Dirichlet mixtures for the parametric estimators, θ 1:T (x 1:T ) = T t=1 If µ = 2 1, we get the Jeffreys mixture. P λ (x t )p µ (λ)dλ = Γ(mµ) mv=1 Γ(t v + µ) Γ(µ) m Γ(t + mµ)
Online Learning of Markov Forests Parametric Regret 21/34 Local regrets Expressions of the form θ (x 1:T ) θ 1:T (x 1:T ) Use Dirichlet Mixtures have been extensively studied in literature of universal coding (Grünwald, 2007). Dirichlet Mixtures Using symmetric Dirichlet mixtures for the parametric estimators, θ 1:T (x 1:T ) = T t=1 If µ = 2 1, we get the Jeffreys mixture. P λ (x t )p µ (λ)dλ = Γ(mµ) mv=1 Γ(t v + µ) Γ(µ) m Γ(t + mµ)
Online Learning of Markov Forests Parametric Regret 21/34 Local regrets Expressions of the form θ (x 1:T ) θ 1:T (x 1:T ) Use Dirichlet Mixtures have been extensively studied in literature of universal coding (Grünwald, 2007). Dirichlet Mixtures Using symmetric Dirichlet mixtures for the parametric estimators, θ 1:T (x 1:T ) = T t=1 If µ = 2 1, we get the Jeffreys mixture. P λ (x t )p µ (λ)dλ = Γ(mµ) mv=1 Γ(t v + µ) Γ(µ) m Γ(t + mµ)
Online Learning of Markov Forests Parametric Regret 22/34 Jeffreys Strategy (An extension of Xie and Barron (2000) to forest parameters) For each trial t, 1 Set θ i t+1 (u) = t u + 2 1 t + m 2 2 Set θ ij t+1 (u,v) = t uv + 2 1 t + m2 2 for all i [n],u [m] for all (i,j) ( [n]) 2,u,v [m] Performance n(m 1) + (n 1)(m 1)2 Minimax regret: ln T 2 2π + C m,n + o(m 2 n) Per-round time complexity: O(m 2 n 2 )
Online Learning of Markov Forests Parametric Regret 22/34 Jeffreys Strategy (An extension of Xie and Barron (2000) to forest parameters) For each trial t, 1 Set θ i t+1 (u) = t u + 2 1 t + m 2 2 Set θ ij t+1 (u,v) = t uv + 2 1 t + m2 2 for all i [n],u [m] for all (i,j) ( [n]) 2,u,v [m] Performance n(m 1) + (n 1)(m 1)2 Minimax regret: ln T 2 2π + C m,n + o(m 2 n) Per-round time complexity: O(m 2 n 2 )
Online Learning of Markov Forests Structural Regret 23/34 Structural Regret Based on the linear form of the log-loss, the structural learning problem can be cast as a sequential combinatorial optimization problem (Audibert et al., 2011). Minimize [ ] T t f s,φ(x s ) t=1 s=1 subject to f 1,,f T F n
Online Learning of Markov Forests Structural Regret 23/34 Structural Regret Based on the linear form of the log-loss, the structural learning problem can be cast as a sequential combinatorial optimization problem (Audibert et al., 2011). Minimize subject to [ ] T t f s,φ(x s ) t=1 s=1 Use the greedy algorithm f 1,,f T F n
Online Learning of Markov Forests Structural Regret 24/34 Follow the Perturbed Leader (Kalai and Vempala, 2005) For each trial t, 1 Draw r t in [0, 1 α t ] uniformly at random 2 Set f t+1 = argmin f Fn f,l t + r t where L t = t s=1 φ(x s ) Performance Minimax regret: n 2 ln(t/2 + m 2 /4) 2T Per-round time complexity: O(n 2 logn)
Online Learning of Markov Forests Structural Regret 24/34 Follow the Perturbed Leader (Kalai and Vempala, 2005) For each trial t, 1 Draw r t in [0, 1 α t ] uniformly at random 2 Set f t+1 = argmin f Fn f,l t + r t where L t = t s=1 φ(x s ) Performance Minimax regret: n 2 ln(t/2 + m 2 /4) 2T Per-round time complexity: O(n 2 logn)
Online Learning of Markov Forests Structural Regret 25/34 Average log-loss 7 6 5 4 3 2 CL CLT OFDEF OFDET 10 100 1000 10000 100000 KDD Cup 2000: Iterations Experiments The average log-loss is measured on the test set at each iteration. The online learner (FPL + Jeffreys) rapidly converges to the batch learner (Chow-Liu for Markov trees, and Thresholded Chow-Liu for Markov forests).
Online Learning of Markov Forests Structural Regret 25/34 Average log-loss 10 9.5 9 8.5 8 7.5 7 6.5 CL CLT OFDEF OFDET 6 10 100 1000 10000 100000 MSNBC: Iterations Experiments The average log-loss is measured on the test set at each iteration. The online learner (FPL + Jeffreys) rapidly converges to the batch learner (Chow-Liu for Markov trees, and Thresholded Chow-Liu for Markov forests).