Online Learning of Probabilistic Graphical Models

Size: px

Start display at page:

Download "Online Learning of Probabilistic Graphical Models"

Valentine Bell
5 years ago
Views:

1 1/34 Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U Nankin 2016

2 Probabilistic Graphical Models 2/34 Outline 1 Probabilistic Graphical Models 2 Online Learning 3 Online Learning of Markov Forests 4 Beyond Markov Forests

3 Probabilistic Graphical Models 3/34 Graphical Models Bioinformatics Financial Modeling Image Processing Social Networks Analysis Graphical Models At the intersection of Probability Theory and Graph Theory, graphical models are a well-studied representation framework with various applications.

4 Probabilistic Graphical Models 4/ X 12 X 9 X 8 X 3 X 11 X 7 X 2 X 10 X 6 X 1 X 4 X Graphical Models Encoding high-dimensional distributions in a compact and intuitive way: Qualitative uncertainty (interdependencies) is captured by the structure Quantitative uncertainty (probabilities) is captured by the parameters

5 Probabilistic Graphical Models 5/34 Graphical Models Bayesian Networks Sparse Bayesian Networks Bayesian Polytrees Bayesian Forests Markov Networks Sparse Markov Networks Bounded Tree-width MNs Markov Forests Classes of Graphical Models For an outcome space X R n, a class of graphical models is a pair M = G Θ, where G is space of n-dimensional graphs, and Θ is a space of d-dimensional vectors. G captures structural constraints (directed vs. undirected, sparse vs. dense, etc.) Θ captures parametric constraints (binomial, multinomial, Gaussian, etc.)

6 Probabilistic Graphical Models 6/34 P(X 2,X 3 ) X 9 X 8 X 7 X 2 X 6 X 15 X 5 X 1 P(X 1 ) X 3 X 13 X 12 X 11 X 10 X 14 X 4 (Multinomial) Markov Forests Using [m] = {0,,m 1}, the class of Markov Forests over [m] n is given by F n Θ m,n, where F n is the space of all acyclic graphs of order n; Θ m,n is the space of all parameter vectors mapping a probability table θ i [0,1] m to each candidate node i, and a probability table θ ij [0,1] m m to each candidate edge (i,j).

7 Probabilistic Graphical Models 7/34 P(X 3 X 2 ) X 9 X 8 X 7 X 2 X 6 X 15 X 5 X 1 P(X 1 ) X 3 X 13 X 12 X 11 X 10 X 14 X 4 (Multinomial) Bayesian Forests The class of Bayesian Forests over [m] n is given by F n Θ m,n, where F n is the space of all directed forests of order n; Θ m,n is the space of all parameter vectors mapping a probability table θ i [0,1] m to each candidate node i, and a conditional probability table θ j i [m] [0,1] m to each candidate arc (i,j).

8 Probabilistic Graphical Models 8/34 Earthquake P(A,E) Burglary Radio Alarm Call Example: The Alarm Model Markov tree representation Bayesian tree representation

9 Probabilistic Graphical Models 8/34 Earthquake P(A E) Burglary Radio Alarm Call Example: The Alarm Model Markov tree representation Bayesian tree representation

10 Online Learning 9/34 Outline 1 Probabilistic Graphical Models 2 Online Learning 3 Online Learning of Markov Forests 4 Beyond Markov Forests

11 Online Learning 10/34 E B R A C B A C E R Learning Given a class of graphical models M = G Θ, the learning problem is to extract from a sequence of outcomes x 1:T = (x 1,,x T ), the structure G G, and the parameters θ Θ of a generative model M = (G, θ) capable of predicting future, unseen, outcomes. A natural loss function for measuring the performance of M is the log-loss: l(m,x) = lnp M (x)

12 Online Learning 10/34 E B R A C B A C E R Learning Given a class of graphical models M = G Θ, the learning problem is to extract from a sequence of outcomes x 1:T = (x 1,,x T ), the structure G G, and the parameters θ Θ of a generative model M = (G, θ) capable of predicting future, unseen, outcomes. A natural loss function for measuring the performance of M is the log-loss: l(m,x) = lnp M (x)

13 Online Learning 10/34 E B R A C B A C E R Learning Given a class of graphical models M = G Θ, the learning problem is to extract from a sequence of outcomes x 1:T = (x 1,,x T ), the structure G G, and the parameters θ Θ of a generative model M = (G, θ) capable of predicting future, unseen, outcomes. A natural loss function for measuring the performance of M is the log-loss: l(m,x) = lnp M (x)

14 Online Learning 11/ M = (G,θ) Graphical Model Training Set Batch Learning A two-stage process: A model M M is first extracted from a training set. The average loss of the model M is evalued using a test set.

15 Online Learning 11/34 M = (G,θ) Graphical Model 1 T Tt=1 l(m,x t ) Test Set of size T Batch Learning A two-stage process: A model M M is first extracted from a training set. The average loss of the model M is evalued using a test set.

16 Online Learning 12/34 Environment Learner Online Learning A sequential process, or repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses a model M t M ; the environment responds by an outcome x t X, and the learner incurs the loss l(m t,x t ).

17 Online Learning 12/34 Environment M t = ( G t,θ t) Learner Online Learning A sequential process, or repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses a model M t M ; the environment responds by an outcome x t X, and the learner incurs the loss l(m t,x t ).

18 Online Learning 12/34 l ( M t,θ t) Environment M t = ( G t,θ t) Learner Online Learning A sequential process, or repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses a model M t M ; the environment responds by an outcome x t X, and the learner incurs the loss l(m t,x t ).

19 Online Learning 12/34 Environment M t+1 = ( G t+1,θ t+1) Learner Online Learning A sequential process, or repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses a model M t M ; the environment responds by an outcome x t X, and the learner incurs the loss l(m t,x t ).

20 Online Learning 12/34 l ( M t+1,θ t+1) Environment M t+1 = ( G t+1,θ t+1) Learner Online Learning A sequential process, or repeated game between the learner and its environment. During each trial t = 1,,T, the learner chooses a model M t M ; the environment responds by an outcome x t X, and the learner incurs the loss l(m t,x t ).

21 Online Learning 13/34 Online vs. Batch In batch (PAC) learning, it is assumed that the data (training set + test set) is generated by a fixed (but unknown) probability distribution. In online learning, there is no statistical assumption about the data generated by the environment. Online learning is particularly suited to: * Adaptive environments, where the target distribution can change over time; * Streaming applications, where all the data is not available in advance; * Large-scale datasets, by processing only one outcome at a time.

22 Online Learning 13/34 Online vs. Batch In batch (PAC) learning, it is assumed that the data (training set + test set) is generated by a fixed (but unknown) probability distribution. In online learning, there is no statistical assumption about the data generated by the environment. Online learning is particularly suited to: * Adaptive environments, where the target distribution can change over time; * Streaming applications, where all the data is not available in advance; * Large-scale datasets, by processing only one outcome at a time.

23 Online Learning 14/34 Regret Let M be a class of graphical models over an outcome space X, and A be a learning algorithm for M. The minimax regret of A at horizon T, is the maximum, over every sequence of outcomes x 1:T = (x 1,,x T ), of the cumulative relative loss between A and the best model in M, i.e. [ ] T R(A,T) = max l(m t,x t T ) min l(m,x t ) x 1:T X T t=1 M M t=1 Learnability A class M is (online) learnable if it admits an online learning algorithm A such that: 1 the minimax regret of A is sublinear in T, i.e. lim R(A,T) = 0 T 2 the per-round computational complexity of A is polynomial in the dimension of M.

24 Online Learning 14/34 Regret Let M be a class of graphical models over an outcome space X, and A be a learning algorithm for M. The minimax regret of A at horizon T, is the maximum, over every sequence of outcomes x 1:T = (x 1,,x T ), of the cumulative relative loss between A and the best model in M, i.e. [ ] T R(A,T) = max l(m t,x t T ) min l(m,x t ) x 1:T X T t=1 M M t=1 Learnability A class M is (online) learnable if it admits an online learning algorithm A such that: 1 the minimax regret of A is sublinear in T, i.e. lim R(A,T) = 0 T 2 the per-round computational complexity of A is polynomial in the dimension of M.

25 Online Learning of Markov Forests 15/34 Outline 1 Probabilistic Graphical Models 2 Online Learning 3 Online Learning of Markov Forests Regret Decomposition Parametric Regret Structural Regret 4 Beyond Markov Forests

26 Online Learning of Markov Forests 16/34 X 3 X X 9 X 8 X 7 X 2 X 6 X 15 X 5 X X 3 X 13 X 12 X 11 X 10 X 14 X 4 Does there exist an efficient online learning algorithm for Markov forests? How can we efficiently update at each iteration both the structure F t and the parameters θ t in order to minimize the cumulative loss t l(f t,θ t,x t )?

27 Online Learning of Markov Forests 16/34 X 3 X X 9 X 8 X 7 X 2 X 6 X 15 X 5 X X 3 X 13 X 12 X 11 X 10 X 14 X 4 Does there exist an efficient online learning algorithm for Markov forests? How can we efficiently update at each iteration both the structure F t and the parameters θ t in order to minimize the cumulative loss t l(f t,θ t,x t )?

28 Online Learning of Markov Forests 17/34 Two key properties For the class F m,n = F n Θ m,n of Markov forests, The probability distribution associated with a Markov forest M = (F, θ) can be factorized into a closed-form: n θ ij (x i,x j ) P M (x) = θ i (x i ) i=1 (i,j) F θ i (x i )θ j (x j ) The space F n of forest structures is a matroid; minimizing a linear function over F n can be done in quadratic time using the matroid greedy algorithm.

29 Online Learning of Markov Forests 17/34 Two key properties For the class F m,n = F n Θ m,n of Markov forests, The probability distribution associated with a Markov forest M = (F, θ) can be factorized into a closed-form: n θ ij (x i,x j ) P M (x) = θ i (x i ) i=1 (i,j) F θ i (x i )θ j (x j ) The space F n of forest structures is a matroid; minimizing a linear function over F n can be done in quadratic time using the matroid greedy algorithm.

30 Online Learning of Markov Forests 18/34 Log-Loss Let M = (f,θ) be a Markov forest, where f is the characteristic vector of the structure. l(m,x) = lnp M (x) = ln θ i (x i ) i=1 (i,j) ( θij (x i,x j ) θ i (x i )θ j (x j ) ) fij So, the log-loss is an affine function of the forest structure: l(m,x) = ψ(x) + f,φ(x) where ψ(x) = i [n] ln 1 θ i (x i ) and φ ij (x i,x j ) = ln ( ) θi (x i )θ j (x j ) θ ij (x i,x j )

31 Online Learning of Markov Forests 18/34 Log-Loss Let M = (f,θ) be a Markov forest, where f is the characteristic vector of the structure. l(m,x) = lnp M (x) = ln θ i (x i ) i=1 (i,j) ( θij (x i,x j ) θ i (x i )θ j (x j ) ) fij So, the log-loss is an affine function of the forest structure: l(m,x) = ψ(x) + f,φ(x) where ψ(x) = i [n] ln 1 θ i (x i ) and φ ij (x i,x j ) = ln ( ) θi (x i )θ j (x j ) θ ij (x i,x j )

32 Online Learning of Markov Forests Regret Decomposition 19/34 Regret Decomposition Based on the linearity of the log-loss, the regret is decomposable into two parts: ( ) ( ) R M 1:T,x 1:T = R f 1:T,x 1:T + R (θ ) 1:T,x 1:T where ( ) R f 1:T,x 1:T T = l(f t,θ t,x t ) l(f,θ t,x t ) t=1 ( ) R θ 1:T,x 1:T T = l(f,θ t,x t ) l(f,θ,x t ) t=1 (Structural Regret) (Parametric Regret)

33 Online Learning of Markov Forests Regret Decomposition 19/34 Regret Decomposition Based on the linearity of the log-loss, the regret is decomposable into two parts: ( ) ( ) R M 1:T,x 1:T = R f 1:T,x 1:T + R (θ ) 1:T,x 1:T where ( ) R f 1:T,x 1:T T = l(f t,θ t,x t ) l(f,θ t,x t ) t=1 ( ) R θ 1:T,x 1:T T = l(f,θ t,x t ) l(f,θ,x t ) t=1 (Structural Regret) (Parametric Regret)

34 Online Learning of Markov Forests Parametric Regret 20/34 Parametric Regret Based on the closed-form expression of Markov forests, the parametric regret is decomposable into local regrets: ( ) R θ 1:T,x 1:T = n ln θ i (x1:t ) i θ i 1:T (x i 1:T ) i=1 + ln (i,j) F + (i,j) F θ ij (x1:t ) ij θ ij 1:T (x ij 1:T ) ln θ1:t i (x i 1:T ) ) θ i (x1:t i θ 1:T j (x j 1:T ) θ j (x1:t ) j (Univariate estimators) (Bivariate estimators) (Bivariate compensation) where θi T (x1:t i ) = θi (xt i ), t=1 θij T (x1:t ij ) = θij (xt i,xt j ), t=1 θ1:t i (xi 1:T T ) = θi t (xt i ) t=1 θ1:t ij (xij 1:T T ) = θij t (xt i,xt j ) t=1

35 Online Learning of Markov Forests Parametric Regret 20/34 Parametric Regret Based on the closed-form expression of Markov forests, the parametric regret is decomposable into local regrets: ( ) R θ 1:T,x 1:T = n ln θ i (x1:t ) i θ i 1:T (x i 1:T ) i=1 + ln (i,j) F + (i,j) F θ ij (x1:t ) ij θ ij 1:T (x ij 1:T ) ln θ1:t i (x i 1:T ) ) θ i (x1:t i θ 1:T j (x j 1:T ) θ j (x1:t ) j (Univariate estimators) (Bivariate estimators) (Bivariate compensation) where θi T (x1:t i ) = θi (xt i ), t=1 θij T (x1:t ij ) = θij (xt i,xt j ), t=1 θ1:t i (xi 1:T T ) = θi t (xt i ) t=1 θ1:t ij (xij 1:T T ) = θij t (xt i,xt j ) t=1

36 Online Learning of Markov Forests Parametric Regret 21/34 Local regrets Expressions of the form θ (x 1:T ) θ 1:T (x 1:T ) Use Dirichlet Mixtures have been extensively studied in literature of universal coding (Grünwald, 2007). Dirichlet Mixtures Using symmetric Dirichlet mixtures for the parametric estimators, θ 1:T (x 1:T ) = T t=1 If µ = 2 1, we get the Jeffreys mixture. P λ (x t )p µ (λ)dλ = Γ(mµ) mv=1 Γ(t v + µ) Γ(µ) m Γ(t + mµ)

37 Online Learning of Markov Forests Parametric Regret 21/34 Local regrets Expressions of the form θ (x 1:T ) θ 1:T (x 1:T ) Use Dirichlet Mixtures have been extensively studied in literature of universal coding (Grünwald, 2007). Dirichlet Mixtures Using symmetric Dirichlet mixtures for the parametric estimators, θ 1:T (x 1:T ) = T t=1 If µ = 2 1, we get the Jeffreys mixture. P λ (x t )p µ (λ)dλ = Γ(mµ) mv=1 Γ(t v + µ) Γ(µ) m Γ(t + mµ)

38 Online Learning of Markov Forests Parametric Regret 21/34 Local regrets Expressions of the form θ (x 1:T ) θ 1:T (x 1:T ) Use Dirichlet Mixtures have been extensively studied in literature of universal coding (Grünwald, 2007). Dirichlet Mixtures Using symmetric Dirichlet mixtures for the parametric estimators, θ 1:T (x 1:T ) = T t=1 If µ = 2 1, we get the Jeffreys mixture. P λ (x t )p µ (λ)dλ = Γ(mµ) mv=1 Γ(t v + µ) Γ(µ) m Γ(t + mµ)

39 Online Learning of Markov Forests Parametric Regret 22/34 Jeffreys Strategy (An extension of Xie and Barron (2000) to forest parameters) For each trial t, 1 Set θ i t+1 (u) = t u t + m 2 2 Set θ ij t+1 (u,v) = t uv t + m2 2 for all i [n],u [m] for all (i,j) ( [n]) 2,u,v [m] Performance n(m 1) + (n 1)(m 1)2 Minimax regret: ln T 2 2π + C m,n + o(m 2 n) Per-round time complexity: O(m 2 n 2 )

40 Online Learning of Markov Forests Parametric Regret 22/34 Jeffreys Strategy (An extension of Xie and Barron (2000) to forest parameters) For each trial t, 1 Set θ i t+1 (u) = t u t + m 2 2 Set θ ij t+1 (u,v) = t uv t + m2 2 for all i [n],u [m] for all (i,j) ( [n]) 2,u,v [m] Performance n(m 1) + (n 1)(m 1)2 Minimax regret: ln T 2 2π + C m,n + o(m 2 n) Per-round time complexity: O(m 2 n 2 )

41 Online Learning of Markov Forests Structural Regret 23/34 Structural Regret Based on the linear form of the log-loss, the structural learning problem can be cast as a sequential combinatorial optimization problem (Audibert et al., 2011). Minimize [ ] T t f s,φ(x s ) t=1 s=1 subject to f 1,,f T F n

42 Online Learning of Markov Forests Structural Regret 23/34 Structural Regret Based on the linear form of the log-loss, the structural learning problem can be cast as a sequential combinatorial optimization problem (Audibert et al., 2011). Minimize subject to [ ] T t f s,φ(x s ) t=1 s=1 Use the greedy algorithm f 1,,f T F n

43 Online Learning of Markov Forests Structural Regret 24/34 Follow the Perturbed Leader (Kalai and Vempala, 2005) For each trial t, 1 Draw r t in [0, 1 α t ] uniformly at random 2 Set f t+1 = argmin f Fn f,l t + r t where L t = t s=1 φ(x s ) Performance Minimax regret: n 2 ln(t/2 + m 2 /4) 2T Per-round time complexity: O(n 2 logn)

44 Online Learning of Markov Forests Structural Regret 24/34 Follow the Perturbed Leader (Kalai and Vempala, 2005) For each trial t, 1 Draw r t in [0, 1 α t ] uniformly at random 2 Set f t+1 = argmin f Fn f,l t + r t where L t = t s=1 φ(x s ) Performance Minimax regret: n 2 ln(t/2 + m 2 /4) 2T Per-round time complexity: O(n 2 logn)

45 Online Learning of Markov Forests Structural Regret 25/34 Average log-loss CL CLT OFDEF OFDET KDD Cup 2000: Iterations Experiments The average log-loss is measured on the test set at each iteration. The online learner (FPL + Jeffreys) rapidly converges to the batch learner (Chow-Liu for Markov trees, and Thresholded Chow-Liu for Markov forests).

46 Online Learning of Markov Forests Structural Regret 25/34 Average log-loss CL CLT OFDEF OFDET MSNBC: Iterations Experiments The average log-loss is measured on the test set at each iteration. The online learner (FPL + Jeffreys) rapidly converges to the batch learner (Chow-Liu for Markov trees, and Thresholded Chow-Liu for Markov forests).

Online Forest Density Estimation

Online Forest Density Estimation Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr UAI 16 1 Outline 1 Probabilistic Graphical Models 2 Online Density Estimation 3 Online Forest Density