Learning and message-passing in graphical models

Size: px

Start display at page:

Download "Learning and message-passing in graphical models"

Melina Sutton
5 years ago
Views:

1 Learning and message-passing in graphical models Martin Wainwright UC Berkeley Departments of Statistics, and EECS Martin Wainwright (UC Berkeley) Graphical models and message-passing 1 / 41

2 graphical model: Introduction graph G = (V,E) with p vertices random vector: (X 1,X 2,...,X p ) (a) Markov chain (b) Multiscale quadtree (c) Two-dimensional grid useful in many statistical and computational fields: statistical signal processing statistical image processing statistical machine learning computational biology, bioinformatics statistical physics communication and information theory Martin Wainwright (UC Berkeley) Graphical models and message-passing 2 / 41

3 Graphs and factorization 2 ψ ψ ψ clique C is a fully connected subset of vertices compatibility function ψ C defined on variables x C = {x s,s C} factorization over all cliques Q(x 1,...,x p ) = 1 Z ψ C (x C ). C C Martin Wainwright (UC Berkeley) Graphical models and message-passing 3 / 41

4 Example: Optical digit/character recognition Goal: correctly label digits/characters based on noisy versions E.g., mail sorting; document scanning; handwriting recognition systems

5 Example: Optical digit/character recognition Goal: correctly label digits/characters based on noisy versions strong sequential dependencies captured by (hidden) Markov chain message-passing spreads information along chain (Baum & Petrie, 1966; Viterbi, 1967, and many others)

6 Example: Image denoising with grid models 8-bit digital image: matrix of intensity values {0,1,...255}

7 Example: Image denoising with grid models ψ uv (x u,x v ) x u x v 8-bit digital image: matrix of intensity values {0,1,...255} 1 γ γ pairwise interaction ψ uv (x u,x v ) = γ 1 γ γ γ 1 γ (0,1) enforces smoothness: {x u = x v } more likely than {x u x v }

8 Example: Multi-scale modeling often useful to move beyond pixel-based representations in image modelling

9 Example: Multi-scale modeling often useful to move beyond pixel-based representations in image modelling compute a multi-scale transform (e.g., wavelets, Gabor filters etc.)

10 Example: Multi-scale modeling often useful to move beyond pixel-based representations in image modelling compute a multi-scale transform (e.g., wavelets, Gabor filters etc.) model coefficients using a multi-scale autoregressive process on a tree (e.g., Willsky, 2002)

11 Example: Depth estimation in computer vision Stereo pairs: two images taken from horizontally-offset cameras

) ψ ab,cd (x ab,x cd ) ψ ab (x ab ) Left image Right image Use

12 Modeling depth with a graphical model Introduce variable at pixel location (a, b): x ab Offset between images in position (a,b) ψ cd (x cd ) ψ ab,cd (x ab,x cd ) ψ ab (x ab ) Left image Right image Use message-passing algorithms to estimate most likely offset/depth map. (Szeliski et al., 2005)

13 Many other examples natural language processing (e.g., parsing, translation) computational biology (gene sequences, protein folding, phylogenetic reconstruction) sensor network deployments (e.g., distributed detection, estimation, fault monitoring) social network analysis (e.g., politics, Facebook, terrorism.) communication theory and error-control decoding (e.g., turbo codes, LDPC codes) satisfiability problems (3-SAT, MAX-XORSAT, graph colouring) robotics (path planning, tracking, navigation)... Martin Wainwright (UC Berkeley) Graphical models and message-passing 9 / 41

14 Remainder of talk 1 Message-passing in graphical models Marginalization and applications Based form of sum-product algorithm (belief propagation) Stochastic belief propagation 2 Learning graphical structure in data Applications Neighborhood-based regression Application to social network data

15 1. Marginalization and message-passing Define marginal distribution at vertex u V: Q(x u y) = Q(x 1,x 2,...,x N y) x v,v V\{u}

16 1. Marginalization and message-passing Define marginal distribution at vertex u V: Q(x u y) = Q(x 1,x 2,...,x N y) x v,v V\{u} Useful in many applications: Optical character recognition

17 1. Marginalization and message-passing Define marginal distribution at vertex u V: Q(x u y) = Q(x 1,x 2,...,x N y) x v,v V\{u} Useful in many applications: Computer vision: computation disparity Q( x }{{} u y 1,...,y }{{ N } ) depth at pixel u multiple images

18 Divide and conquer Goal: Compute marginal distribution at node 1. M 21 M x 2,x 3 Q(x 1,x 2,x 3 ) x 2,x 3 { } ψ 1 (x 1 )ψ 2 (x 2 )ψ 3 (x 3 )ψ 12 (x 1,x 2 )ψ 23 (x 2,x 3 )

19 Divide and conquer Goal: Compute marginal distribution at node 1. M 21 M x 2,x 3 Q(x 1,x 2,x 3 ) x 2,x 3 { } ψ 1 (x 1 )ψ 2 (x 2 )ψ 3 (x 3 )ψ 12 (x 1,x 2 )ψ 23 (x 2,x 3 ) Q(x 1,x 2,x 3 ) ψ 1 (x 1 ) {ψ 2 (x 2 )ψ 12 (x 1,x 2 ) { ψ 3 (x 3 ) ψ 23 (x 2,x 3 ) } } x 2,x 3 x 2 x 3 }{{} M 32(x 2)

20 Divide and conquer Goal: Compute marginal distribution at node 1. M 21 M x 2,x 3 Q(x 1,x 2,x 3 ) x 2,x 3 { } ψ 1 (x 1 )ψ 2 (x 2 )ψ 3 (x 3 )ψ 12 (x 1,x 2 )ψ 23 (x 2,x 3 ) Q(x 1,x 2,x 3 ) ψ 1 (x 1 ) {ψ 2 (x 2 )ψ 12 (x 1,x 2 ) { ψ 3 (x 3 ) ψ 23 (x 2,x 3 ) } } x 2,x 3 x 2 x 3 }{{} M 32(x 2) } {{ } M 21(x 1)

21 Decompose: Iterating the procedure Q(x) [ ψ 1 (x 1 ) ψ 12 (x 1,x 2 ) ] u N(2) M u2(x 2 ). x 2,x 3,x 4,x 5 x 2 M 21 M 32 4 M M 53 5 Update messages: M 32 (x 2 ) = x 3 ψ 3 (x 3 ) ψ 23 (x 2,x 3 ) M v3 (x 3 ) v N(3)\{2}

22 Ordinary belief propagation Algorithm 1 At time t = 0, for all edges (v,u) E, initialize message vectors: M 0 vu(k) = 1 for all states k {1,...,d}.

23 Ordinary belief propagation Algorithm 1 At time t = 0, for all edges (v,u) E, initialize message vectors: M 0 vu(k) = 1 for all states k {1,...,d}. 2 At time t = 0,1,2,..., for each edge directed (v u), update message: M t+1 vu (k) d ψ v (j)ψ uv (k,j) j=1 w N(u)\{v} M t wv(j).

24 Ordinary belief propagation Algorithm 1 At time t = 0, for all edges (v,u) E, initialize message vectors: M 0 vu(k) = 1 for all states k {1,...,d}. 2 At time t = 0,1,2,..., for each edge directed (v u), update message: M t+1 vu (k) d ψ v (j)ψ uv (k,j) j=1 w N(u)\{v} M t wv(j). 3 Upon convergence, compute (approximate) marginal distributions: Q[X u = k] ψ u(k) w N(u) M wu(k).

25 Ordinary belief propagation Algorithm 1 At time t = 0, for all edges (v,u) E, initialize message vectors: M 0 vu(k) = 1 for all states k {1,...,d}. 2 At time t = 0,1,2,..., for each edge directed (v u), update message: M t+1 vu (k) d ψ v (j)ψ uv (k,j) j=1 w N(u)\{v} M t wv(j). 3 Upon convergence, compute (approximate) marginal distributions: Q[X u = k] ψ u(k) w N(u) M wu(k). Computational complexity: Θ(d 2 ) operations per edge per round Communication complexity: Requires passing d 1 real numbers per edge.

26 Some reduced complexity methods faster methods for problems with algebraic structure: low-density parity check codes (e.g., Gallager, 1963; Richardson & Urbanke, 2008) Toeplitz matrices in computer vision (Felzenszwalb & Huttenlocher, 2006) exploiting symmetries (Kersten et al., 2009; McAyley & Caetano, 2011) Martin Wainwright (UC Berkeley) Graphical models and message-passing 15 / 41

27 Some reduced complexity methods faster methods for problems with algebraic structure: low-density parity check codes (e.g., Gallager, 1963; Richardson & Urbanke, 2008) Toeplitz matrices in computer vision (Felzenszwalb & Huttenlocher, 2006) exploiting symmetries (Kersten et al., 2009; McAyley & Caetano, 2011) quantized forms of belief propagation (Gallager, 1963; Coughlan & Shen, 2007) lower complexity (computation and communication) do not converge to exact BP fixed point Martin Wainwright (UC Berkeley) Graphical models and message-passing 15 / 41

28 Some reduced complexity methods faster methods for problems with algebraic structure: low-density parity check codes (e.g., Gallager, 1963; Richardson & Urbanke, 2008) Toeplitz matrices in computer vision (Felzenszwalb & Huttenlocher, 2006) exploiting symmetries (Kersten et al., 2009; McAyley & Caetano, 2011) quantized forms of belief propagation (Gallager, 1963; Coughlan & Shen, 2007) lower complexity (computation and communication) do not converge to exact BP fixed point particle filtering and non-parametric belief propagation (Doucet et al., 2001; Sudderth et al., 2003) designed for continuous state spaces high-fidelity requires taking number of particles to infinity Martin Wainwright (UC Berkeley) Graphical models and message-passing 15 / 41

29 Stochastic belief propagation Off-line phase: Pre-compute the normalized columns Γ uv (:,j) = ψ uv(:,j)ψ v (j) d where β uv (j) = ψ uv (k,j)ψ v (j). β uv (j) k=1

30 Stochastic belief propagation Off-line phase: Pre-compute the normalized columns Γ uv (:,j) = ψ uv(:,j)ψ v (j) d where β uv (j) = ψ uv (k,j)ψ v (j). β uv (j) Stochastic belief propagation (BP) k=1 1 Initialize message vectors M 0 vu(k) = 1 for all k {1,...,d}.

31 Stochastic belief propagation Off-line phase: Pre-compute the normalized columns Γ uv (:,j) = ψ uv(:,j)ψ v (j) d where β uv (j) = ψ uv (k,j)ψ v (j). β uv (j) Stochastic belief propagation (BP) k=1 1 Initialize message vectors M 0 vu(k) = 1 for all k {1,...,d}. 2 At time t = 0,1,2,..., for each directed edge (v u): (A) Compute product of incoming messages: M v\u (j) := w N(v)\{u} M t wv(j).

32 Stochastic belief propagation Off-line phase: Pre-compute the normalized columns Γ uv (:,j) = ψ uv(:,j)ψ v (j) d where β uv (j) = ψ uv (k,j)ψ v (j). β uv (j) Stochastic belief propagation (BP) k=1 1 Initialize message vectors M 0 vu(k) = 1 for all k {1,...,d}. 2 At time t = 0,1,2,..., for each directed edge (v u): (A) Compute product of incoming messages: M v\u (j) := w N(v)\{u} M t wv(j). (B) Pick a random index J t vu according to probability distribution: p t vu(j) M v\u (j) β uv(j) for j {1,...,d}.

33 Stochastic belief propagation Off-line phase: Pre-compute the normalized columns Γ uv (:,j) = ψ uv(:,j)ψ v (j) d where β uv (j) = ψ uv (k,j)ψ v (j). β uv (j) Stochastic belief propagation (BP) k=1 1 Initialize message vectors M 0 vu(k) = 1 for all k {1,...,d}. 2 At time t = 0,1,2,..., for each directed edge (v u): (A) Compute product of incoming messages: M v\u (j) := w N(v)\{u} M t wv(j). (B) Pick a random index J t vu according to probability distribution: p t vu(j) M v\u (j) β uv(j) for j {1,...,d}. (C) Update message vector M t+1 vu (:) R d with stepsize λ t (0,1): M t+1 vu (:) = (1 λ t ) M t vu(:)+λ t Γuv ( :,J t vu ).

34 Stochastic belief propagation Stochastic belief propagation (BP) 1 Initialize message vectors M 0 vu(k) = 1 for all k {1,...,d}. 2 At time t = 0,1,2,..., for each directed edge (v u): (A) Compute product of incoming messages: M v\u (j) := w N(v)\{u} M t wv(j). (B) Pick a random index J t vu according to probability distribution: p t vu(j) M v\u (j) β uv(j) for j {1,...,d}. (C) Update message vector M t+1 vu (:) R d with stepsize λ t (0,1): M t+1 vu (:) = (1 λ t ) M t vu(:)+λ t Γuv ( :,J t vu ). Computational complexity: Θ(d) operations per edge per round Communication complexity: Requires passing log(d) bits per edge.

35 Sample paths on grid-structured graph 10 0 Normalized squared error Number of iterations Martin Wainwright (UC Berkeley) Graphical models and message-passing 17 / 41

36 Some intuition stochastic BP update is unbiased at each round E[M t+1 vu M t vu] = (1 λ t )M t vu(:)+λ t d p uv (j) Γ uv (:, j) j=1 Martin Wainwright (UC Berkeley) Graphical models and message-passing 18 / 41

37 Some intuition stochastic BP update is unbiased at each round E[M t+1 vu M t vu] = (1 λ t )M t vu(:)+λ t d ψ uv (:,j)ψ v (j) j=1 w N(v)\{u} Mwv t ( ) : Martin Wainwright (UC Berkeley) Graphical models and message-passing 18 / 41

38 Some intuition stochastic BP update is unbiased at each round E[M t+1 vu M t vu] = (1 λ t )M t vu(:)+λ t d ψ uv (:,j)ψ v (j) j=1 w N(v)\{u} Mwv t ( ) : variance will depend on step sizes {λ t } t=1 need step size λ t 0 to smooth out fluctuations need t=1 λt + for infinite travel Martin Wainwright (UC Berkeley) Graphical models and message-passing 18 / 41

39 Some intuition stochastic BP update is unbiased at each round E[M t+1 vu M t vu] = (1 λ t )M t vu(:)+λ t d ψ uv (:,j)ψ v (j) j=1 w N(v)\{u} Mwv t ( ) : variance will depend on step sizes {λ t } t=1 need step size λ t 0 to smooth out fluctuations need t=1 λt + for infinite travel Challenges 1 Need stability condition = average path is well-behaved. 2 Control on spread of noise through network Martin Wainwright (UC Berkeley) Graphical models and message-passing 18 / 41

40 SBP on graphs with cycles Assume that ordinary BP is µ-contractive: F(M) F(M ) 2 (1 µ/2) M M 2.

41 SBP on graphs with cycles Assume that ordinary BP is µ-contractive: F(M) F(M ) 2 (1 µ/2) M M 2. Theorem With step size λ t = 1 µ(t+1), the SBP iterates have the following guarantees: (a) The SBP messages {M t } t=0 converge almost surely to unique BP fixed point M as t +.

42 SBP on graphs with cycles Assume that ordinary BP is µ-contractive: F(M) F(M ) 2 (1 µ/2) M M 2. Theorem With step size λ t = 1 µ(t+1), the SBP iterates have the following guarantees: (a) The SBP messages {M t } t=0 converge almost surely to unique BP fixed point M as t +. (b) The expected squared error is bounded as: E[ M t M 2 2 M 2 2 K(ψ) (1 c 0 µ 2 t ) +c1 M 0 M 2 2 M 2 2 (1) 2, t where K(ψ) is a constant depending on potential functions ψ.

43 SBP on graphs with cycles Assume that ordinary BP is µ-contractive: F(M) F(M ) 2 (1 µ/2) M M 2. Theorem With step size λ t = 1 µ(t+1), the SBP iterates have the following guarantees: (a) The SBP messages {M t } t=0 converge almost surely to unique BP fixed point M as t +. (b) The expected squared error is bounded as: E[ M t M 2 2 M 2 2 K(ψ) (1 c 0 µ 2 t ) +c1 M 0 M 2 2 M 2 2 (1) 2, t where K(ψ) is a constant depending on potential functions ψ. (c) For every q 1, with probability at least 1 2 t q, we have M t M 2 2 M 2 2 (1+2q) 32K(ψ) µ 2 (logt t 3/4 ).

44 Complexity comparison Algorithm Communication Computation Iteration Ordinary BP d 1 real numbers Θ(d 2 ) log(1/δ) Stochastic BP log d bits Θ(d) (1/δ) Martin Wainwright (UC Berkeley) Graphical models and message-passing 20 / 41

45 Denoising Martin Wainwright (UC Berkeley) Graphical models and message-passing 21 / 41

46 Denoising Ordinary BP SBP Martin Wainwright (UC Berkeley) Graphical models and message-passing 21 / 41

47 Comparison of running times BP SBP 340 Mean squared error Running time (sec)

48 Application to optical flow computation Martin Wainwright (UC Berkeley) Graphical models and message-passing 23 / 41

49 Application to optical flow computation Martin Wainwright (UC Berkeley) Graphical models and message-passing 23 / 41

50 2. Learning graph structure from data up to now, have been discussing a forward problem: computing marginals in a graphical model inverse problems concern learning the parameters and structure of graphs from data Martin Wainwright (UC Berkeley) Graphical models and message-passing 24 / 41

51 2. Learning graph structure from data up to now, have been discussing a forward problem: computing marginals in a graphical model inverse problems concern learning the parameters and structure of graphs from data many instances of such graph learning problems: fitting graphs to politicians voting behavior modeling diseases with epidemiological networks traffic flow modeling interactions between different genes and so on... Martin Wainwright (UC Berkeley) Graphical models and message-passing 24 / 41

52 Example: US Senate network ( voting) (Banerjee et al., 2008; Ravikumar, W. & Lafferty, 2010)

53 Example: Biological networks gene networks during Drosophila life cycle (Ahmed & Xing, PNAS, 2009) many other examples: protein networks phylogenetic trees Martin Wainwright (UC Berkeley) Graphical models and message-passing 26 / 41

54 Learning for pairwise models drawn n samples from 1 Q(x 1,...,x p ;Θ) = Z(Θ) exp{ θ s x 2 s + } θ st x s x t s V (s,t) E graph G and matrix [Θ] st = θ st of edge weights are unknown Martin Wainwright (UC Berkeley) Graphical models and message-passing 27 / 41

55 Learning for pairwise models drawn n samples from 1 Q(x 1,...,x p ;Θ) = Z(Θ) exp{ θ s x 2 s + } θ st x s x t s V (s,t) E graph G and matrix [Θ] st = θ st of edge weights are unknown data matrix: Ising model (binary variables): X n 1 {0,1} n p Gaussian model: X n 1 R n p estimator X n 1 Θ Martin Wainwright (UC Berkeley) Graphical models and message-passing 27 / 41

56 Learning for pairwise models drawn n samples from 1 Q(x 1,...,x p ;Θ) = Z(Θ) exp{ θ s x 2 s + } θ st x s x t s V (s,t) E graph G and matrix [Θ] st = θ st of edge weights are unknown data matrix: Ising model (binary variables): X n 1 {0,1} n p Gaussian model: X n 1 R n p estimator X n 1 Θ various loss functions are possible: graph selection: supp[ Θ] = supp[θ]? bounds on Kullback-Leibler divergence D(Q Θ Q Θ) bounds on Θ Θ op. Martin Wainwright (UC Berkeley) Graphical models and message-passing 27 / 41

57 Challenges in graph selection For pairwise models, negative log-likelihood takes form: l(θ;x n 1) := 1 n n logq(x i1,...,x ip ;Θ) i=1

58 Challenges in graph selection For pairwise models, negative log-likelihood takes form: l(θ;x n 1) := 1 n n logq(x i1,...,x ip ;Θ) i=1 for Gaussian graphical models, this is a log-determinant program for discrete graphical models, exact computation of likelihood is impossible (curse of dimensionality)

59 Challenges in graph selection For pairwise models, negative log-likelihood takes form: l(θ;x n 1) := 1 n n logq(x i1,...,x ip ;Θ) i=1 for Gaussian graphical models, this is a log-determinant program for discrete graphical models, exact computation of likelihood is impossible (curse of dimensionality) various work-arounds are possible: Markov chain Monte Carlo and stochastic gradient variational approximations to likelihood pseudo-likelihoods or neighborhood-based approaches

60 Methods for graph selection for Gaussian graphical models: l1-regularized neighborhood regression for Gaussian MRFs (e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao & Yu, 2006) l1-regularized log-determinant (e.g., Yuan & Lin, 2006; d Asprémont et al., 2007; Friedman, 2008; Rothman et al., 2008; Ravikumar et al., 2008)

61 Methods for graph selection for Gaussian graphical models: l1-regularized neighborhood regression for Gaussian MRFs (e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao & Yu, 2006) l1-regularized log-determinant (e.g., Yuan & Lin, 2006; d Asprémont et al., 2007; Friedman, 2008; Rothman et al., 2008; Ravikumar et al., 2008) methods for discrete MRFs exact solution for trees (Chow & Liu, 1967) local testing (e.g., Spirtes et al, 2000; Kalisch & Buhlmann, 2008) various other methods distribution fits by KL-divergence (Abeel et al., 2005) l 1 -regularized log. regression (Ravikumar, W. & Lafferty et al., 2008, 2010) approximate max. entropy approach and thinned graphical models (Johnson et al., 2007) neighborhood-based thresholding method (Bresler, Mossel & Sly, 2008)

62 Methods for graph selection for Gaussian graphical models: l1-regularized neighborhood regression for Gaussian MRFs (e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao & Yu, 2006) l1-regularized log-determinant (e.g., Yuan & Lin, 2006; d Asprémont et al., 2007; Friedman, 2008; Rothman et al., 2008; Ravikumar et al., 2008) methods for discrete MRFs exact solution for trees (Chow & Liu, 1967) local testing (e.g., Spirtes et al, 2000; Kalisch & Buhlmann, 2008) various other methods distribution fits by KL-divergence (Abeel et al., 2005) l 1 -regularized log. regression (Ravikumar, W. & Lafferty et al., 2008, 2010) approximate max. entropy approach and thinned graphical models (Johnson et al., 2007) neighborhood-based thresholding method (Bresler, Mossel & Sly, 2008) information-theoretic analysis pseudolikelihood and BIC criterion (Csiszar & Talata, 2006) information-theoretic limitations (Santhanam & W., 2008, 2012)

63 Markov property and neighborhood structure Markov properties encode neighborhood structure: (X s X V\s ) }{{} Condition on full graph d = (X s X N(s) ) }{{} Condition on Markov blanket N(s) = {s,t,u,v,w} X s X t X w X s X u X v basis of pseudolikelihood method (Besag, 1974) basis of many graph learning algorithms (Friedman et al., 1999; Csiszar & Talata, 2005; Abeel et al., 2006; Meinshausen & Buhlmann, 2006) Martin Wainwright (UC Berkeley) Graphical models and message-passing 30 / 41

64 Graph selection via neighborhood regression X \s X s..... Predict X s based on X \s := {X s, t s}.

65 Graph selection via neighborhood regression X \s X s..... Predict X s based on X \s := {X s, t s}. 1 For each node s V, compute (regularized) max. likelihood estimate: { } θ[s] := arg min 1 n L(θ;X i, \s ) + λ n θ 1 θ R p 1 n }{{}}{{} i=1 local log. likelihood regularization

66 Graph selection via neighborhood regression X \s X s..... Predict X s based on X \s := {X s, t s}. 1 For each node s V, compute (regularized) max. likelihood estimate: { } θ[s] := arg min 1 n L(θ;X i, \s ) + λ n θ 1 θ R p 1 n }{{}}{{} i=1 local log. likelihood regularization 2 Estimate the local neighborhood N(s) as support of regression vector θ[s] R p 1.

67 High-dimensional analysis classical analysis: graph size p fixed, sample size n + high-dimensional analysis: allow both dimension p, sample size n, and maximum degree d to increase at arbitrary rates take n i.i.d. samples from MRF defined by G p,d study probability of success as a function of three parameters: Success(n,p,d) = Q[Method recovers graph G p,d from n samples] theory is non-asymptotic: explicit probabilities for finite (n, p, d)

68 Empirical behavior: Unrescaled plots 1 Star graph; Linear fraction neighbors 0.8 Prob. success p = 64 p = 100 p = Number of samples

69 Empirical behavior: Appropriately rescaled 1 Star graph; Linear fraction neighbors 0.8 Prob. success p = 64 p = 100 p = Control parameter

70 Rescaled plots (2-D lattice graphs) 1 4 nearest neighbor grid (attractive) 0.8 Prob. success p = 64 p = 100 p = Control parameter

71 Sufficient conditions for consistent Ising selection graph sequences G p,d = (V,E) with p vertices, and maximum degree d. edge weights θ st θ min for all (s,t) E draw n i.i.d, samples, and analyze prob. success indexed by (n,p,d) Theorem (Ravikumar, W. & Lafferty, 2006, 2010)

72 Sufficient conditions for consistent Ising selection graph sequences G p,d = (V,E) with p vertices, and maximum degree d. edge weights θ st θ min for all (s,t) E draw n i.i.d, samples, and analyze prob. success indexed by (n,p,d) Theorem (Ravikumar, W. & Lafferty, 2006, 2010) Under incoherence conditions, for a rescaled sample γ LR (n,p,d) := and regularization parameter λ n c 1 1 2exp ( c 2 λ 2 nn ) : n d 3 logp > γ crit logp n, then with probability greater than (a) Correct exclusion: The estimated sign neighborhood N(s) correctly excludes all edges not in the true neighborhood.

73 Sufficient conditions for consistent Ising selection graph sequences G p,d = (V,E) with p vertices, and maximum degree d. edge weights θ st θ min for all (s,t) E draw n i.i.d, samples, and analyze prob. success indexed by (n,p,d) Theorem (Ravikumar, W. & Lafferty, 2006, 2010) Under incoherence conditions, for a rescaled sample γ LR (n,p,d) := and regularization parameter λ n c 1 1 2exp ( c 2 λ 2 nn ) : n d 3 logp > γ crit logp n, then with probability greater than (a) Correct exclusion: The estimated sign neighborhood N(s) correctly excludes all edges not in the true neighborhood. (b) Correct inclusion: For θ min c 3 λ n, the method selects the correct signed neighborhood.

74 Some related work thresholding estimator (poly-time for bounded degree) works with n 2 d logp samples (Bresler et al., 2008)

75 Some related work thresholding estimator (poly-time for bounded degree) works with n 2 d logp samples (Bresler et al., 2008) information-theoretic lower bound over family G p,d : any method requires at least n = Ω(d 2 logp) samples (Santhanam & W., 2008, 2012)

76 Some related work thresholding estimator (poly-time for bounded degree) works with n 2 d logp samples (Bresler et al., 2008) information-theoretic lower bound over family G p,d : any method requires at least n = Ω(d 2 logp) samples (Santhanam & W., 2008, 2012) l 1 -based method: sharper achievable rates, also failure for θ large enough to violate incoherence (Bento & Montanari, 2009)

77 Some related work thresholding estimator (poly-time for bounded degree) works with n 2 d logp samples (Bresler et al., 2008) information-theoretic lower bound over family G p,d : any method requires at least n = Ω(d 2 logp) samples (Santhanam & W., 2008, 2012) l 1 -based method: sharper achievable rates, also failure for θ large enough to violate incoherence (Bento & Montanari, 2009) empirical study: l 1 -based method can succeed beyond phase transition on Ising model (Aurell & Ekeberg, 2011)

78 Some related work thresholding estimator (poly-time for bounded degree) works with n 2 d logp samples (Bresler et al., 2008) information-theoretic lower bound over family G p,d : any method requires at least n = Ω(d 2 logp) samples (Santhanam & W., 2008, 2012) l 1 -based method: sharper achievable rates, also failure for θ large enough to violate incoherence (Bento & Montanari, 2009) empirical study: l 1 -based method can succeed beyond phase transition on Ising model (Aurell & Ekeberg, 2011) simpler neighborhood-based methods: thresholding, mutual information, greedy-methods Anandkumar, Tan & Willsky, 2010a, 2010b Netrapalli et al., 2010

79 Info. theory: Graph selection as channel coding graphical model selection is an unorthodox channel coding problem: Martin Wainwright (UC Berkeley) Graphical models and message-passing 38 / 41

80 Info. theory: Graph selection as channel coding graphical model selection is an unorthodox channel coding problem: codewords/codebook: graph G in some graph class G channel use: draw sample Xi = (X i1,...,x ip from Markov random field Q θ(g) decoding problem: use n samples {X1,...,X n} to correctly distinguish the codeword G Q(X G) X 1,...,X n Martin Wainwright (UC Berkeley) Graphical models and message-passing 38 / 41

81 Info. theory: Graph selection as channel coding graphical model selection is an unorthodox channel coding problem: codewords/codebook: graph G in some graph class G channel use: draw sample Xi = (X i1,...,x ip from Markov random field Q θ(g) decoding problem: use n samples {X1,...,X n} to correctly distinguish the codeword G Q(X G) X 1,...,X n Channel capacity for graph decoding determined by balance between log number of models relative distinguishability of different models Martin Wainwright (UC Berkeley) Graphical models and message-passing 38 / 41

82 Necessary conditions for G d,p G G d,p : graphs with p nodes and max. degree d Ising models with: Minimum edge weight: θ st θ min for all edges Maximum neighborhood weight: ω(θ) := max s V θst t N(s) Martin Wainwright (UC Berkeley) Graphical models and message-passing 39 / 41

83 Necessary conditions for G d,p G G d,p : graphs with p nodes and max. degree d Ising models with: Minimum edge weight: θ st θ min for all edges Maximum neighborhood weight: ω(θ) := max s V θst t N(s) Theorem If the sample size n is upper bounded by (Santhanam & W, 2008) { d n < max 8 log p 8d, exp( ω(θ) 4 )dθ minlog(pd/8), 128exp( 3θmin 2 ) logp } 2θ min tanh(θ min ) then the probability of error of any algorithm over G d,p is at least 1/2. Martin Wainwright (UC Berkeley) Graphical models and message-passing 39 / 41

84 Necessary conditions for G d,p G G d,p : graphs with p nodes and max. degree d Ising models with: Minimum edge weight: θ st θ min for all edges Maximum neighborhood weight: ω(θ) := max s V θst t N(s) Theorem If the sample size n is upper bounded by (Santhanam & W, 2008) { d n < max 8 log p 8d, exp( ω(θ) 4 )dθ minlog(pd/8), 128exp( 3θmin 2 ) logp } 2θ min tanh(θ min ) then the probability of error of any algorithm over G d,p is at least 1/2. Interpretation: Naive bulk effect: Arises from log cardinality log G d,p d-clique effect: Difficulty of separating models that contain a near d-clique Small weight effect: Difficult to detect edges with small weights. Martin Wainwright (UC Berkeley) Graphical models and message-passing 39 / 41

85 Some consequences Corollary For asymptotically reliable recovery over G d,p, any algorithm requires at least n = Ω(d 2 logp) samples. Martin Wainwright (UC Berkeley) Graphical models and message-passing 40 / 41

86 Some consequences Corollary For asymptotically reliable recovery over G d,p, any algorithm requires at least n = Ω(d 2 logp) samples. note that maximum neighborhood weight ω(θ ) dθ min = require θ min = O(1/d) Martin Wainwright (UC Berkeley) Graphical models and message-passing 40 / 41

87 Some consequences Corollary For asymptotically reliable recovery over G d,p, any algorithm requires at least n = Ω(d 2 logp) samples. note that maximum neighborhood weight ω(θ ) dθ min = require θ min = O(1/d) from small weight effect logp n = Ω( θ min tanh(θ min ) ) = Ω( logp) θ 2 min Martin Wainwright (UC Berkeley) Graphical models and message-passing 40 / 41

88 Some consequences Corollary For asymptotically reliable recovery over G d,p, any algorithm requires at least n = Ω(d 2 logp) samples. note that maximum neighborhood weight ω(θ ) dθ min = require θ min = O(1/d) from small weight effect logp n = Ω( θ min tanh(θ min ) ) = Ω( logp) θ 2 min conclude that l 1 -regularized logistic regression (LR) is optimal up to a factor O(d) (Ravikumar., W. & Lafferty, 2010) Martin Wainwright (UC Berkeley) Graphical models and message-passing 40 / 41

89 1 Stochastic belief propagation Summary adaptively randomized version of ordinary message-passing finite-length convergence analysis 2 Learning structure of graphical models polynomial-time method for learning neighborhood structure natural extensions (using block regularization) to higher order models matches information-theoretic limits Some papers: Noorshams & W. (2011). Stochastic belief propagation: Low-complexity message-passing with guarantees. Proceedings of the Allerton Conference. Ravikumar, W. & Lafferty (2010). High-dimensional Ising model selection using l 1 -regularized logistic regression. Annals of Statistics. Santhanam & W. (2012). Information-theoretic limits of selecting binary graphical models in high dimensions, IEEE Transactions on Information Theory.

Introduction to graphical models: Lecture III

Introduction to graphical models: Lecture III Martin Wainwright UC Berkeley Departments of Statistics, and EECS Martin Wainwright (UC Berkeley) Some introductory lectures January 2013 1 / 25 Introduction