Most image analysis problems are \inference" problems: f g \Inference" Observed image Inferred image For example, \edge detection": \Inference" Page 2

Size: px

Start display at page:

Download "Most image analysis problems are \inference" problems: f g \Inference" Observed image Inferred image For example, \edge detection": \Inference" Page 2"

Bertina McDowell
5 years ago
Views:

1 Methods and Bayesian Random Fields Markov Department of Electrical and Computer Engineering Superior Tecnico Lisboa, PORTUGAL Thanks: Anil K. Jain and Robert D. Nowak, Michigan State University, USA

$Most image analysis problems are \inference" problems: f g \Inference"$

2 Most image analysis problems are \inference" problems: f g \Inference" Observed image Inferred image For example, \edge detection": \Inference" Page 2

$The word \image" should$ sense.

3 The word \image" should be understood in a wide sense. Examples: Conventional image CT image Flow image Range image Page 3

$\inference" problems$ Edge detection Image

4 Examples of \inference" problems Edge detection Image restoration Template matching Contour estimation Page 4

5 Main features of \image analysis" problems They are inference problems, i.e., they can be formulated as: \from g, infer f" They can not be solved without using a priori knowledge. Both f and g are high-dimensional. (e.g., images). They are naturally formulated as statistical inference problems. Page 5

6 the Bayesian approach provides a way to \invert" an Basically, model, taking prior knowledge into account. observation Introduction to Bayesian theory g f Observation model Unknown Observed f b Inferred knowledge knowledge Bayesian decision Inferred = estimated, or Loss function detected, or classied,... Page 6

7 Since we are considering one particular patient, this tumor". has no frequential meaning it expresses a degree of belief. statement It can be shown that probability theory is the right tool to formally with \degrees of belief" or \knowledge" deal The Bayesian philosophy Knowledge, probability A subjective (non-frequentist) interpretation of probability. Probabilities express \degrees of belief". Example: \there is a 20% probability that a certain patient has a Cox (46), Savage (54), Good (60), Jereys (39, 61), Jaynes (63, 68, 91). Page 7

8 L(f b f) quantity Inferred f b Bayesian decision theory Knowledge about f Observation model Loss function p(gjf) p(f) Bayesian decision theory rule Decision f = (g) b Observed data g An \algorithm" Page 8

9 The \conditionality principle": any inference must be based on the observed data (g). (conditioned) The \likelihood principle": The information contained in the g can only be carried via the likelihood function p(fjg). observation knowledge about f, once g is observed, is expressed by the Accordingly, posteriori (or posterior) probability function: a p(f) p(gjf) p(g) How are Bayesian decision rules derived? By applying the fundamental principles of the Bayesian philosophy: Knowledge is expressed via probability functions. p(fjg) = \Bayes law" Page 9

10 fbayes = b (g) = arg min Bayes f b df p(f) b fjg p(fjg) f) L(f b How are Bayesian decision rules derived? (cont.) Once g is observed, knowledge about f is expressed by p(fjg). Given g, what is the expected value of the loss function L(f b f)? Z hl(f b f)jg i E =...the so-called \a posteriori expected loss". An \optimal Bayes rule", is one minimizing : fjg p(f) b p(f) b fjg Page 10

11 How are Bayesian decision rules derived? (cont.) Prior p(f) Bayes Law p(fjg) = p(f) p(gjf) p(g) Likelihood p(gjf) Loss posteriori expected loss" \a p(f) b fjg hl(f b f)jg i Bayesians, \Pure here! Report stop L(f b f) = E the posterior" min arg f b p(f) b fjg rule Decision f = (g) b Minimize \Bayesian image processor" Page 11

12 p(f) p(gjf) p(g) More on Bayes law. p(fjg) = The numerator is the joint probability of f and g: p(gjf) p(f) = p(g f): The denominator is simply a normalizing constant, Z Z p(g) = p(g f) df = p(gjf) p(f) df...it is a marginal probability function. Other names: unconditional, predictive, evidence. In discrete cases, rather than integral we have a summation. Page 12

13 The \0/1" loss function For a scalar continuous f 2 F, e.g., F = IR, 8 < b 1 ( ; b jf fj " L (f f) = " fj < " b ; jf ( : 0 "0/1" loss function, with ε = Loss f δ(g) Page 13

14 " (g) = arg min ZF d arg = p(fjg) min df Zf:jf;dj" d Z d+" The \0/1" loss function (cont.) Minimizing the \a posteriori" expected loss: L " (f d) p(fjg) df Z p(fjg) df! arg min = 1 ; d f:jf;dj<" p(fjg) df arg max = d d;" Page 14

15 Z d+" p(fjg) MAP (g) b fmap Letting " approach zero, p(fjg) df lim "(g) = lim arg max d "!0 "!0 d;" arg max = f...called the \maximum a posteriori" (MAP) estimator. " decreasing, " (g) \looks With the highest mode of p(fjg) for" Page 15

16 arg min = f;p(djg) + X d f2f p(fjg) MAP (g) b fmap The \0/1" loss for a scalar discrete f 2 F 8 < 1 b ( f 6= b f L(f f) = = b f f ( : 0 Again, minimizing the \a posteriori" expected loss: L " (f ; d) p(fjg) = arg min (g) d X f2f p(f jg) arg min = d X f6=d g p(f jg) {z } 1 arg max = f...the \maximum a posteriori" (MAP) classier/detector. Page 16

17 f ; b f 2 arg min = fe f 2 jg {z } d = E [fjg] b fpm \Quadratic" loss function For a scalar continuous f 2 F, e.g., F = IR, f b f L = Minimizing the a posteriori expected loss, PM (g) = arg min d E h d) 2 i jg ; (f + d 2 ; 2 de[fjg]g Constant...the \posterior mean" (PM) estimator. Page 17

18 Example: Gaussian observations with a Gaussian prior. The observation model is p(gjf) = p([g 1 g 2 :::g n ] T jf) N ([f f :::f] T 2 I) ; 2 ( ;n=2 exp ; 1 2 = 2 2 nx i=1 (g i ; f) 2 ) where I denotes an identity matrix. The prior is N (0 2 ) = 2 2 ;1=2 exp ; f 2 p(f) 2 ; 2 From these two models, the posterior is simply p(f jg) N 2 g 2 n + 2 n ;1! with g = g 1 + ::: + g n n Page 18

19 2 bf MAP = b fpm = g n n Example: Gaussian observations with a Gaussian prior (cont.). As seen in the previous slide p(fjg) N g n ;1! n Then, since the mean and the mode of a Gaussian coincide, the estimate is a \shrunken" version of the sample mean g. If the prior had mean, we would have MAP = b fpm = 2 n + g2 bf i.e., the estimate is a weighted average of and g Page 19

20 n + 2 Example: Gaussian observations with a Gaussian prior (cont.). Observe that 2 n + g ng 2 lim n!1 lim = n!1 2 + n 2 = g i.e., as n increases, the data dominates the estimate. The posterior variance does not depend on g, h b i 2 jg f) ; (f n ;1 E = inversely proportional to the degree of condence on the estimate. Notice also that lim h ; b 2 i f) jg (f E n!1 n ;1 = 0 lim = n!1...as n! 1 the condence on the estimate becomes absolute. Page 20

21 1.2 Estimates Sample mean MAP estimate with φ 2 = 0.1 MAP estimate with φ 2 = Number of observations 10 1 A posteriori variance, with φ 2 = 0.1 A posteriori variance, with φ 2 = 0.01 Variances Number of observations Page 21

22 Example: Gaussian mixture observations with a Gaussian prior. \Mixture" observation model p ; s (g ; )2 ; 2 2 exp exp (g ; s)2 ; p 2 ; p(gjs) = + + g s 8 < w/ prob. n N (0 2 ) 0 w/ prob. (1 ; ) : Gaussian prior p(s) N (0 2 ). Page 22

23 ; ; Example: Gaussian mixture observations with a Gaussian prior (cont.). The posterior: ; s ; )2 (g 2 2 s2 ; 2 2 ; s)2 (g 2 2 s2 ; 2 2 p(sjg) / exp + (1 ; )exp Example: = 0:6 p S (s g=0.5) = 4 g = 0: MAP PM s 2 = 0:5 PM = \compromise" MAP = largest mode. Page 23

24 MAP = arg max bf f p(f) p(gjf) p(g) Improper priors and \maximum likelihood" inference Recall that the posterior is computed according to p(fjg) = If the MAP criterion is being used, and p(f) = k, p(gjf) k Z k p(gjf) df p(gjf) arg max = f...the \maximum likelihood" (ML) estimate. In the discrete case, simply replace the integral by a summation. Page 24

25 If the space to which f belongs is unbounded, e.g., f 2 IR m, or 2 IN, the prior is \improper": f X p(f) = k X 1 = 1: Improper priors reinforce the \knowledge" interpretation of probabilities. Improper priors and maximum likelihood inference (cont.) Z Z p(f) df = k df = 1: or If the posterior is proper, all the estimates are still well dened. Page 25

26 fbayes = b (g) = arg min Bayes f b Compound inference: Inferring a set of unknowns Now, f is a (say, m-dimensional) vector, f = [f 1 f 2 ::: f m ] T : Loss functions for compound problems: Additive: Such that L(f b f) = M X i=1 L i (f i b fi ). Non-additive: This decomposition does not exist. Optimal Bayes rules are still Z L(f b f)p(fjg) df Page 26

27 fmap = b (g) = arg max MAP f Compound inference with non-additive loss functions. There is nothing fundamentally new in this case. The \0/1" loss, for a vector f (e.g., F = IR m ): 8 < 1 ( k ; f b f k " L (f b f) = " f k< " ;b f k ( : 0 Following the same derivation yields p(f jg) the MAP estimate is the joint mode of the a posteriori i.e., function. probability Exactly the same expression is obtained for discrete problems. Page 27

28 The loss, for f 2 IR m : L(f b f) = (f ; b T f) Q(f ;b f) quadratic is a symmetric positive-denite (m m) matrix. Q where arg = fe min f T Qfjg {z } f b = E [fjg] b fpm +b f T Qb f ; 2 b f T Q E [fjg]g Remarkably, this is true for any symmetric positive-denite Q. case: Q is diagonal, the loss function is additive. Special Compound inference with non-additive loss functions (cont.) Minimizing the a posteriori expected loss, h(f ;b f) T Q(f ;b f)jg i PM (g) = arg min f b E Constant = solution of (Q has inverse) nqb f = QE [fjg] o...still the \posterior mean" (PM) estimator. Page 28

29 Z mx i=1 mx mx p(fjg) df ;i df i Compound inference with additive loss functions Recall that, in this case, L(f b f) = M X i=1 L i (f i b fi ). The optimal Bayes rule i (f i b fi ) p(fjg) df L Bayes = arg min (g) f b {z } L(f bf) Z L i (f i b fi ) p(fjg) df arg min = f b i=1 Z Z L i (f i b fi ) arg min = f b i=1 df ;i denotes df 1 :::df i;1 df i+1 :::df m, that is, integration with where to all variables except f i respect Page 29

30 is, b fibayes = arg min that f i b b f mx mx p(fjg) df ;i Conclusion: each estimate is the minimizer of the corresponding a posteriori expected loss marginal df i Compound inference with additive loss functions (cont.) From the previous slide: Z Z L i (f i b fi ) Bayes = arg min (g) f b i=1 Z But, p(fjg) df ;i = p(f i jg), the a posteriori marginal of variable f i. Z Then, (g) Bayes = arg min L i (f i b fi )p(f i jg) df i, i=1 Z i = 1 2 :::m L i (f i b fi )p(f i jg) df i Page 30

31 L(f b f) = m X i=1(f i ; b fi ) 2 = (f ;b f) T (f ;b f). p(f m jg) T Additive loss functions: Special cases The additive \0/1" loss function: L(f b f) = m X i=1 L i (f i b fi ), where each L i (f i b fi ) is a \0/1" loss function for scalar arguments. According to the general result, max arg b fmpm = f 1 1 jg) arg max p(f 2 f 2 jg) arg max p(f m f the maximizer of posterior marginals (MPM). The additive quadratic loss function: The general result for quadratic loss functions is still valid. This is a natural fact because the mean is intrinsically marginal. Page 31

32 Observation model: linear operator (matrix) plus additive white noise: Gaussian = p p(f) exp ; 1 (2);n=2 f T K ;1 f 2 det(k) Example: Gaussian observations and Gaussian prior. g = Hf + n where n N (0 2 I) Corresponding likelihood function p(gjf) = (2 2 ) ;n=2 exp 1 ; 2 k Hf ; g k2 2 Gaussian prior: Page 32

33 f = arg min b f f T 2 K ;1 + H T H f ; 2f T H T g Example: Gaussian observations and Gaussian prior (cont.) The a posteriori (joint) probability density function is still Gaussian bf P p(fjg) N with b f being the MAP and PM estimate, given by = 2 K ;1 + H T H ;1 H T g: This is also called the (vector) Wiener lter. Page 33

34 b f = H T H ;1 H T g: If H is not invertible, H y provides its least-squares sense pseudo-solution. Example: Gaussian observations and Gaussian prior special cases. No noise: Absence of noise, 2 = 0 n ; gk 2o khf arg min = f H T H ;1 H T H y is called the Moore-Penrose pseudo generalized) inverse of matrix H. (or If H ;1 exists, H y = H ;1 This estimate is also the maximum likelihood one. Page 34

35 K ;1 = B ;1 = 2 is positive denite ) exists unique symmetric D that DD = D T D = B ;1. such In regularization theory parlance, kdfk 2 is called the regularizing and 2 = 2 the regularization parameter. term, Example: Gaussian observations and Gaussian prior special cases. covariance up to a factor: K = 2 B diagonal elements of B Prior to 1. 2 can be seen as the \prior variance". equal This allows writing f = arg min b f ; Hfk kg 2 kdfk2 We can also write b f = 2 2 B;1 + H T H ;1 H T g 2 = 2 controls the relative weight of the prior and the data. Page 35

36 of what we have seen up to this point Summary Image analysis problems are inference problems Introduction to Bayesian inference: Fundamental principles: knowledge as probability, likelihood and { conditionality. { Fundamental tool: Bayes rule. { Necessary models: observation model, prior, loss function. { A posteriori expected loss and optimal Bayes rules. { The \0/1" loss function and MAP inference. { The quadratic error loss function and posterior mean estimation. { Example: Gaussian observations and Gaussian prior. { Example: Mixture of Gaussians observations and Gaussian prior. { Improper priors and maximum likelihood (ML) inference. { Compound inference: additive and non-additive loss functions. { Example: Gaussian observations with Gaussian prior. Page 36

37 Sometimes the prior knowledge is vague enough to allow tractability to come into play. concerns In other words: choose priors compatible with knowledge, but to a tractable a posteriori probability function. leading p(gjf) p(fj) ) = p(fjg) p(g) Conjugate priors: Looking for computational convenience Conjugate priors formalize this goal. A family of likelihood functions L = fp(gjf) f 2 Fg A (parametrized) family of priors P = fp(fj) 2 g P is a conjugate family for L, if 8 < 9 = p(gjf) 2 L p(f j) 2 P : 2 P i.e., 9 0 2, such that p(fjg) = p(fj 0 ). Page 37

38 P = p(fj 2 ) N ( 2 ) ( 2 ) 2 IR IR + Very important: computing the a posteriori probability function involves \updating" parameters of the prior. only Conjugate priors: A simple example The family of Gaussian likelihood functions of common variance L = p(gjf) N (f 2 ) f 2 IR The family of Gaussian priors of arbitrary mean and variance The a posteriori probability density function is 2 + g p(fjg) N 2 P Page 38

39 + ) ;( ;1 (1 ; ) ;1 ;();() Conjugate priors: Another example is the (unknown) \heads" probability of a given coin. Outcomes of a sequence of n tosses: x = (x 1 ::: x n ), x i 2 f1 0g. Likelihood function (Bernoulli), with n h (x) = x 1 + x 2 + ::: + x n, p(xj) = n h(x) (1 ; ) n;n h(x) : A priori belief: \ should be close to 1=2". Conjugate prior: the Beta density p(j ) Be( ) = dened for 2 [0 1] and > 0. Page 39

40 + + + ) 2 ( + + 1) ( ; 1 + ; 2 Conjugate priors: Bernoulli example (cont.) Main features of Be( ): E[j ] = (mean) " ; 2 # E (variance) = max arg (mode, if > 1) p(j ) = \Pull" the estimate towards 1=2: choose =. The quantity = controls \how strongly we pull". Page 40

41 = 1, qualitatively dierent behavior: the mode at 1=2 For disappears. Conjugate priors: Bernoulli example (cont.) Several Beta densities: Beta prior α = β = 1 α = β = 2 α = β = 10 α = β = θ Page 41

42 PM = PM (x) = + n h(x) b + + n MAP = MAP (x) = + n h(x) ; 1 b + + n ; 2 : Conjugate priors: Bernoulli example (cont.) The a posteriori distribution is again Beta p(jx ) Be( + n h (x) + n ; n h (x)) Bayesian estimates of Page 42

43 of the a posteriori densities, for a Be(5 5) prior (dotted line) Evolution Be(1 1) at prior (solid line). and Conjugate priors: Bernoulli example (cont.) n= n= n= n= n= n= Page 43

44 n i.i.d. zero-mean Gaussian observations of unknown variance 2 = 1= ny n ( 2 exp ; 2 ;() ;1 exp f;g nx x 2 i Conjugate priors: Variance of Gaussian observations Likelihood function r exp ; x2 i 2 2 ) f(xj) = = : i=1 2 i=1 Conjugate prior: the Gamma density. p(j ) Ga( ) = for 2 [0 1) (recall = 1= 2 ) and > 0. Page 44

45 E[j ] = ) = ; 1 p(j Conjugate priors: Variance of Gaussian observations (cont.) Main features of the Gamma density: (mean) " 2 # ; E = 2 (variance) (mode, if 1 ) max arg Page 45

46 Conjugate priors: Variance of Gaussian observations (cont.) Several Gamma densities: Gamma prior α = β = 1 α = β = 2 α = β = 10 α = 62.5, β = θ Page 46

47 p(jx 1 x 2 ::: x n ) Ga + n b PM = b MAP = n n nx n n b MAP = b ML = n x 2 i nx! ;1 nx nx x 2 i x 2 i! ;1 Conjugate priors: Variance of Gaussian observations (cont.) A posteriori density:! : i=1 The corresponding Bayesian estimates n i= ; 2 n n x 2 i! ;1 : i=1 Both estimates converge to the ML estimate: PM = lim b n!1 lim n!1 i=1 Page 47

48 The von Mises Theorem As long as the prior is continuous and not zero at the location of the ML estimate, then, the MAP estimate converges to the ML estimate as the number of data points n increases. Page 48

49 f (m) jg) = p(gjf (m) m) p(f (m) m) p(m p(g) p(gjf (m) m) p(f (m) jm) p(m) = p(g) Bayesian model selection Scenario: there are K models available, i.e., m 2 fm 1 ::: m K g Given model m, Likelihood function: p(gjf (m) m) Prior: p(f (m) jm) Under dierent m's, f (m) may have dierent meanings, and sizes. A priori model probabilities fp(m) m = m 1 ::: m K g. The a posteriori probability function is Page 49

50 Seen strictly as a model selection problem, the natural loss function the \0/1" with respect to the model, i.e. is Z p(m) p(gjf (m) m) p(f (m) jm) df (m) Main diculty: improper priors (for p(f (m) jm)) are not valid, they are only dened up to a factor. because 8 < L[(m f (m) ) (bm b f( bm) )] = 0 ( bm = m 1 ( bm 6= m : The resulting rule is the \most probable mode a posteriori" Z = arg max bm m p(m f (m) jg) df (m) p(mjg) = arg max m arg max = m arg maxfp(m) p(gjm) g = m {z } Evidence Page 50

51 Comparing two models: which of m 1 or m 2 is a posteriori more likely? 1 jg) p(m 2 jg) = p(gjm 1 ) p(gjm 2 ) p(m 1 ) p(m 2 ) p(m Bayesian model selection Answer is given by the so-called \posterior odds ratio" {z } factor" \Bayes' {z } odds ratio" \prior Bayes' factor = evidence, provided by g, for m 1 versus m 2. Page 51

52 a sequence of binary variables (e.g., coin tosses) comes from two Does sources? dierent m 1 =\all g i 's come from the same i.i.d. binary source with = " (e.g., same coin). Prob(1) m 2 =\[g 1 ::: g t ] and [g t+1 ::: g 2t ] come from two dierent sources Prob(1)= and Prob(1) =, respectively" with Notice that with =, m 2 becomes m 1 Bayesian model selection: Example Observations: g = [g 1 ::: g t g t+1 ::: g2t], with g i 2 f0 1g. Competing models: (e.g., two coins with dierent probabilities of \heads"). Parameter vector under m 1, f (m1 ) = [] Parameter vector under m 2, f (m2 ) = [ ] Page 52

53 g i (1 ; ) 1;g i = n(g) (1 ; ) 2t;n(g) p(gj m 2 ) = n 1(g) (1 ; ) t;n 1(g) n 2(g) (1 ; ) t;n 2(g) n 1 (g) and n 2 (g) are the numbers of ones in the rst and where halves of the data, respectively. second Bayesian model selection: Example (cont.) Likelihood function under m 1 : Y2t p(gj m 1 ) = i=1 where n(g) is the total number of 1's. Likelihood function under m 2 : Notice that n 1 (g) + n 2 (g) = n(g). Page 53

54 These two priors mean: \in any case, we know nothing about the parameters". Bayesian model selection: Example (cont.) Prior under m 1 : p(jm 1 ) = 1 for 2 [0 1] Prior under m 2 : p( jm 2 ) = 1 for ( ) 2 [0 1] [0 1] Page 54

55 Z 1 1 Z 1 Z (t ; n 1(g)! n 1 (g)! = + 1)! (t ; n 2 (g)! n 2 (g)! (t + 1)! (t ; n(g))! n(g)! (2t + 1)! (2a Bayesian model selection: Example (cont.) Evidence in favor of m 1 (recall that p(jm 1 ) = 1) p(m 1 jg) = n(g) (1 ; ) 2t;n(g) d = 0 Evidence in favor of m 2 (recall that p( jm 2 ) = 1): p(m 2 jg) = n 1(g) (1 ; ) t;n 1(g) n 2(g) (1 ; ) t;n 2(g) d d 0 0 Page 55

56 regions for all possible outcomes with 2t = 100, and Decision 1 ) = p(m 2 ) = 1=2. p(m Bayesian model selection: Example (cont.) m2 (two sources) n m1 (same source) m2 (two sources) n1 Page 56

57 Bayesian model selection: Another example Segmenting a sequence of binary i.i.d. observations: Trials Log of Bayes factor Log of Bayes factor Log of Bayes factor First segmentation Candidate location 0 Segmentation of left segment Candidate location 0 Segmentation of right segment Candidate location Is there a change of model? Where? Page 57

58 By using a Taylor expansion of the likelihood, around the ML and for a \smooth enough" prior, we have estimate, ' p(gjb f(m) m) n ; dim(f (m) ) 2 BIC(m) p(gjm) b f(m) is the ML estimate, under model m. log (BIC(m)) = ; log p(gjb f(m) m) + dim(f (m)) ; 2 Model selection: Schwarz's Bayesian inference criterion (BIC) Often, it is very dicult/impossible to compute p(gjm). dim(f (m) ) =\dimension of f (m) under model m". n is the size of the observation vector g. Let us also look at log n Page 58

59 i.e., if k was known, we could nd the ML estimate b f(k) selection: Rissanen's minimum description length Model (MDL) Consider an unknown f (k) of unknown dimension k. Data is observed according to p(gjf (k) ) For each k (each model), p(f (k) jk) is constant However, k is unknown, and the likelihood increases with k: k 2 > k 1 ) p(gjb f(k2 )) p(gjb f(k1 )) Conclusion: the ML estimate of k is: \as large as possible" this is clearly useless. Page 59

60 Fact (from information theory): the shortest code-length for data g that it was generated according to p(gjf (k) ) is given Then, for a given k, looking for the ML estimate of f (k) is the same looking for the code for which g has the shortest code-word: as If a code is built to transmit g, based on f (k), then f (k) also has to transmitted. Conclusion: the total code-length is be Minimum description length (MDL) L(gjf (k) ) = ; log 2 p(gjf (k) ) (bits) ) = arg min p(gjf (k) ; log p(gjf(k) ) (k) max arg (k) f arg = L(gjf min (k) ) (k) f L(g f (k) ) = L(gjf (k) ) + L(f (k) ) Page 60

61 Basically, the term L(f (k) ) grows with k counterbalancing the of the likelihood. behavior p(f (k) ) / 2 ;L(f (k)) Minimum description length (MDL) (cont.) The total code-length is L(g f (k) ) = ; log 2 p(gjf (k) ) + L(f (k) ) The MDL criterion: (b k b f( bk) ) MDL min = arg ; log2 p(gjf (k) ) + L(f (k) ) (k) k f From a Bayesian point of view, we have a prior Page 61

62 If the components of f (k) are real numbers (and under certain other the (asymptotically) optimal choice is conditions) In other situations (e.g., discrete parameters), there are natural choices. Minimum description length (cont.) What about L(f (k) )? It is problem-dependent. L(f (k) ) = k 2 log n where n is the size of the data vector g. Interestingly, in this case MDL coincides with BIC. Page 62

63 a polynomial of unknown degree: f (k+1) contains the coecients Fitting a k-order polynomial. of Minimum description length: Example 10 Order = 2 10 Order = 3 10 Order = Order = 15 Observation model: g =\true polynomial plus white Gaussian noise" Order = 4 10 Order = 6 10 Order = Order = Page 63

64 Minimum description length: Example Fitting a polynomial of unknown degree. 1.2 ; log p(gjf (k) ) keeps going down, but MDL picks the right order b k = 4. log likelihood Polynomial order 10 Description length Polynomial order Page 64

65 Markov random elds: a convenient tool to write priors for image problems. analysis Just as Markov random processes formalize temporal evolutions/dependencies. Introduction to Markov Random Fields Image analysis problems, compound inference problems. Prior p(f) formalizes expected joint behavior of elements of f. Page 65

66 A graph G = (N E) is a collection of nodes (or vertices) = fn 1 n 2 :::n jnj g N We consider only undirected graphs, i.e., the elements of E are seen unordered pairs: (n i n j ) (n j n i ). as Two nodes n 1, n 2 2 N are neighbors if the corresponding edge i.e., if (n 1 n 2 ) 2 E. exists, Graphs and random elds on graphs. Basic graph-theoretic concepts and edges E = f(n i1 n i2 ) :::(n i2jej;1 n i2jej )g N N. Notation: jnj = number of elements of set N. Page 66

67 Graphs and random elds on graphs. Basic graph-theoretic concepts (cont.) A complete graph: all nodes are neighbors of all other nodes. A node is not neighbor of itself no (n i n i ) edges are allowed. Neighborhood of a node: N(n i ) = fn j : (n i n j ) 2 Eg. The neighborhood relation is symmetrical: n j 2 N(n i ), n i 2 N(n j ) Page 67

68 Graphs and random elds on graphs. Example of a graph: N = f g E = f(1 2) (1 3) (2 4) (2 5) (3 6) (5 6) (3 4) (4 5)g N N N(1) = f2 3g, N(2) = f1 4 5g, N(3) = f1 4 6g, etc... Page 68

69 other words, a single node or a subset of nodes that are all In neighbors. mutual Graphs and random elds on graphs. Clique of G is either a single node or a complete subgraph of G. Examples of cliques from the previous graph Set of all cliques (from the same example): C = N [ E [ f(2 4 5)g Page 69

70 A length-k path in G is an ordered sequence of nodes, (n 1 n 2 :::n k ), that (n j n j+1 ) 2 E. such Graphs and random elds on graphs. Example: graph a path a and length Page 70

71 We say that C separates A from B if any path from a node in A to node in B contains one (or more) node in C. a Graphs and random elds on graphs. Let A, B, C be three disjoint subsets of N. Example, in the graph C = f1 4 6g separates A = f3g from B = f2 5g Page 71

72 Assign each variable to a node of a graph, N = f1 2 ::: mg. have \random eld on graph N". We Let f A, f B, f C be three disjoint subsets of F (i.e., A, B, and C are subsets of N. If disjoint is global Markov" with respect to N. The graph is called an \p() of p(f) \I-map" Graphs and random elds on graphs. Consider a joint probability function p(f) = p(f 1 f 2 ::: f m ).. p(f A f B jf C ) = p(f A jf C )p(f B jf C ) ( \C separates A from B". Any p(f) is \global Markov" with respect to the complete graph. If rather than (, we have,, the graph is called a \perfect I-map". Page 72

73 Pair-wise Markovianity: (i j) 62 E ) \f i and f j are when conditioned on all the other variables". independent, simply notice that if i and j are not neighbors, the Proof: nodes separate i from j. remaining in the following graph, Example: 1 f 6 jf 2 f 3 f 4 f 5 ) = p(f 1 jf 2 f 3 f 4 f 5 )p(f 6 jf 2 f 3 f 4 f 5 ). p(f Graphs and random elds on graphs. Pair-wise Markovianity f f f f 6 f 2 f 5 Page 73

74 Local Markovianity: i f N=(fig[N(i)) jf N(i) ) = p(f i jf N(i) ) p(f N=(fig[N(i)) jf N(i) ) p(f i f N=(fig[N(i)) jf N(i) ) p(f N=(fig[N(i)) jf N(i) ) p(f Local Markovianity. \given its neighborhood, a variable is independent on the rest". Proof: Notice that N(f i ) separates f i from the rest of the graph. Equivalent form (better known in the MRF literature): p(f i jf N=fig ) = p(f i jf N(i) ) Proof: divide the above equality by p(f N=(fig[N(i)) jf N(i) ): = p(f i jf N(i) ) p(f i jf N=fig ) = p(f i jf N(i) ) because [N=(fig [ N(i))] [ N(i) = N=fig. Page 74

75 If the eld F has the local Markov property, then p(f) can be written a) a Gibbs distribution as Z, the normalizing constant, is called the partition function. where functions V C () are called clique potentials. The negative of the The If p(f) can be written in Gibbs form for the cliques of some graph, b) it has the global Markov property. then consequence: a Markov random eld can be specied via Fundamental clique potentials. the Hammersley-Cliford theorem Consider a random eld F on a graph N, such that p(f) > 0. = p(f) exp ( ; X ) 1 V C (f C ) Z C2C exponent is called energy. Page 75

76 Computing the local Markovian conditionals from the clique potentials Notice that the normalizing constant may depend on the state. neighborhood Hammersley-Cliford theorem (cont.) 1 N(i) ) exp ( ; X Z(f V C (f C ) ) p(f i jf N(i) ) = C:i2C Page 76

77 Let us now focus on regular rectangular lattices. = f(i j) i = 1 ::: M j = 1 ::: Ng N A hierarchy neighborhood systems: 0 (i j) = f g, zero-order (empty neighborhoods) N neighbors) 2 (i j) = f(k l) (i ; k) 2 + (j ; l) 2 2g, order-2 (8 nearest N neighbors) etc... Regular rectangular lattices N 1 (i j) = f(k l) (i ; k) 2 + (j ; l) 2 1g, order-1 (4 nearest Page 77

78 Regular rectangular lattices Illustration of rst order neighborhood system: i;1 j;1 i;1 j i;1 j+1 i j;1 i j 1 j+1 i+1 j;1 i+1 j i+1 j+1 N 1 (i j) = f(i;1 j) (i j;1) (i j+1) (i+1 j)g (4 nearest neighbors). Page 78

79 Regular rectangular lattices Illustration of second order neighborhood system: i;1 j;1 i;1 j i;1 j+1 i j;1 i j 1 j+1 i+1 j;1 i+1 j i+1 j+1 N 2 (i j) = f(i;1 j;1) (i;1 j) (i;1 j+1) (i j;1) (i j+1) (i+1 j;1) (i+1 j) (i+1 j+1)g (8 nearest neighbors). Page 79

80 of a rst order neighborhood system: all single nodes plus all Cliques of the types subgraphs Notation: k = \set of all cliques for the order-k neighborhood system". C Regular rectangular lattices i;1 j i j;1 i j i j Page 80

81 of a second order neighborhood: C 1 plus all subgraphs of the Cliques types Regular rectangular lattices i j;1 i j;1 i;1 j i;1 j i j i j i j+1 i;1 j;1 i j i j i j+1 i j;1 i;1 j i j i+1 j i+1 j Page 81

82 Auto-models Only pair-wise interactions. In terms of clique potentials: jcj > 2 ) V C () = 0. These are the simples models, beyond site independence. Even for large neighborhoods, we can dene an auto-model. Page 82

83 mx 1 ; f T Af 2 exp mx f i f j A ij Matrix A (the potential matrix, inverse of the covariance matrix) the neighborhood system: determines Gauss-Markov Random Fields (GMRF) Joint probability density function (for zero mean) p det(a) m=2 (2) p(f) = The quadratic form in the exponent can be written as f T Af = i=1 j=1 revealing that this is an auto-model (there are only pair-wise terms). i 2 N(j), A ij 6= 0 Page 83

84 that to be a valid potential matrix, A has to be symmetric, Notice respecting the symmetry of neighborhood relations. thus Page 84

85 8 >< ; A ii 2 >: 1 f j A ij A 2 9 >= Gauss-Markov Random Fields (GMRF) Local (Markov-type) conditionals are univariate Gaussian p(f i jff j j 6= ig) = r Aii exp 2 0 fi ; X A ii j6=i > N 0 X A ii j6=i A ij f j 1 1 A ii A Page 85

86 V C (f C ) = 2 (X j2c as long as we dene C j X ; V C (f) = ; 2 C2C = ; 2 C j f j) 2 = 2 (X X j2n C j f j C j C k C j f j) 2 Gauss-Markov Random Fields (GMRF) Specication via clique-potentials: squares of dierences, = 0 ( j 62 C. The exponent of the GMRF density becomes A X C2C! f j f k ; 2 f T Af X X X k2n j2n C2C {z } jk A showing this is a GMRF with potential matrix 2 A Page 86

87 Random Fields (GMRF): The classical \smoothing Gauss-Markov GMRF. prior" A rst order neighborhood j)) = f(i ; 1 j) (i j ; 1) (i + 1 j) (i j + 1)g N((i V f(i j) (i j;1)g (f ij f ij;1 ) = 2 (f ij ; f ij;1 ) 2 V f(i j) (i;1 j)g (f ij f i;1 j ) = 2 (f ij ; f i;1 j ) 2 Matrix A is also quasi-block-toeplitz with quasi-toeplitz blocks. due to boundary corrections. \Quasi-" A lattice N = f(i j) i = 1 ::: M j = 1 ::: Ng Clique set: all pairs of (vertically or horizontally) adjacent sites. Clique-potentials: squares of rst-order dierences, Resulting A matrix: block-tridiagonal with tridiagonal blocks. Page 87

88 Observation model: linear operator (matrix) plus additive white noise, Gaussian There is nothing new: we saw before that the MAP and PM are simply: estimates b f = 2 A + H T H ;1 H T g Bayesian image restoration with GMRF prior: n ; f T o Af 2 A \smoothing" GMRF prior: p(f) / exp where A is as dened in the previous slide. g = Hf + n where n N (0 2 I): Models well: out-of-focus blur, motion blur, tomographic imaging,......only diculty: the matrix to be inverted is huge. Page 88

89 With a \smoothing" GMRF prior and a linear observation model Gaussian noise, optimal estimate: plus b f = 2 A + H T H ;1 H T g A similar result can be obtained in other theoretical frameworks: penalized-likelihood. regularization, 2 A + H T H ;1 H T = H T H ;1 H T H y lim!0 The huge size of 2 A + H T H precludes any explicit inversion. schemes are (almost always) used. Iterative Bayesian image restoration with GMRF prior (cont.) Notice that the (least squares) pseudo-inverse of H. Page 89

$Denoising: oversmoothing, i.e. \discontinuites" Deblurring: smoothed out. are Bayesian image restoration with GMRF prior (cont.$

90 Original (b) blurred and slightly noisy (c) restored from (b) (a) no blur, severe noise (e) restored from (d). (d) good job. Denoising: oversmoothing, i.e. \discontinuites" Deblurring: smoothed out. are Bayesian image restoration with GMRF prior (cont.) Examples: a) b) c) d) e) Page 90

91 Replace the \square law" potentials by other more \robust" functions. quadratic nature of the a posteriori energy is lost. The optimization becomes much more dicult. Consequence: Solutions to the oversmoothing nature of the GMRF prior. Explicitly detect and preserve discontinuities: compound GMRF weak-membrane, etc... models, A new set of variables comes into play: the edge (or line) eld. Page 91

92 f i;1 j V (f ij f ij;1 v ij ) = 2 (1 ; v ij)(f ij ; f ij;1 ) 2 V (f ij f i;1 j h ij ) = 2 (1 ; h ij)(f ij ; f i;1 j ) 2 Compound Gauss-Markov random elds Insert a binary variables which can \turn o" clique potentials. h i j v i j 2 f0 1g h i j 2 f0 1g f i j;1 v i j f i j fi j New clique potentials: Page 92

93 V (f ij f ij;1 0) = 2 (f ij ; f ij;1 ) 2 V (f ij f i;1 j 0) = 2 (f ij ; f i;1 j ) 2 Compound Gauss-Markov random elds (cont.) The line variables can \turn on" the quadratic potentials, or \turn them o", V (f ij f ij;1 1) = 0 V (f ij f i;1 j 1) = 0 meaning, \there is an edge here, do not smooth!". Page 93

94 Given a certain conguration of line variables, we still have a Gauss random eld Markov b f(h v) = 2 A(h v) + H T H ;1 H T g Compound Gauss-Markov random elds (cont.) p(fjh v) / exp n ; f T o A(h v)f 2 but the potential matrix now depends on h and v. Given h and v, the MAP (and PM) estimate of f has the same form: Question: how to estimate h and v? Hint: h and v are \parameters" of the prior. This motivates a detour on: \how to estimate parameters?" Page 94

95 The likelihood (observation model) depends on parameter(s), i.e., write p(gjf ). we ) p(fj ) p(gjf ) p(gj Parameter estimation in Bayesian inference problems The prior depends on parameter(s), i.e., we write p(f j ). With explicit reference to these parameters, Bayes rule becomes: p(gjf ) p(fj ) p(fjg ) = = Z p(gjf ) p(fj )df Question: how can we estimate and from g, without violating the fundamental \likelihood principle"? Page 95

96 add one more level, consider parameters of the hyper-prior To ) = p( j) p(). And so on... p( Parameter estimation in Bayesian inference problems How to estimate and from g, without violating the \likelihood principle"? Answer: the scenario has to be modied. { Rather than just f there is a new set of unknowns: (f ). { There is a new likelihood function: p(gjf ) = p(gjf ). { A new prior is needed: p(f ) = p(fj ) p( ), because f is independent of. Usually, p( ) is called a hyper-prior. This is called a hierarchical Bayesian setting here, with two levels. Usually, and are a priori independent, p( ) = p() p( ). Page 96

97 ) p(fj ) p( ) p(gjf p(g) Parameter estimation in Bayesian inference problems We may compute a complete a posterior probability function: p(gjf ) p(f ) Z Z Z p(f jg) = p(gjf ) p(f )dfdd = How to use it, depends on the adopted loss function. Notice that, even if f,, and are scalar, this is now a compound inference problem. Page 97

98 f b )JMAP = arg max (b p(f b ) (f f b )JMAP = arg max (b p(f b ) (f Parameter estimation in Bayesian inference problems Non-additive \0/1" loss function L i ) (b f b b ). h(f As seen above, this leads to the joint MAP (JMAP) criterion: jg) With a uniform prior on the parameters p( ) = k, jg) arg = p(gjf max ) p(fj ) ) (f arg = p(g max fj ) (f ) (b f b b )GML sometimes called the generalized maximum likelihood (GML). Page 98

99 fmmap = arg max b f Parameter estimation in Bayesian inference problems A \0/1" loss function, additive with respect to f and the parameters, i.e., h(f ) (b f b b ) i = L (1) h f b f i + L (2) h ( ) (b b ) i L L (1) [ ] is a non-additive \0/1" loss function L (2) [ ] is an arbitrary loss function. From the results above on additive loss functions, the estimate of f is Z Z p(f jg) d d p(f jg) arg max = f the so-called marginalized MAP (MMAP). The parameters are \integrated out" from the a posteriori density. Page 99

100 b )MMAP = arg max (b ) ( Parameter estimation in Bayesian inference problems As in the previous case, let h(f ) (b f b b ) i = L (1) h f b f i + L (2) h ( ) (b b ) i L now, with L (2) [ ] a non-additive \0/1" loss function. Considering a uniform prior p( ) = k, Z p(f jg) df Z p(gjf ) p(fj ) df arg max = ) ( Z arg max = ) ( p(g fj df arg max ) p(gj = ) ) ( the so-called marginal maximum likelihood (MML) estimate. The unknown image is \integrated out" from the likelihood function. Page 100

101 JMAP (b f b b )JMAP = arg max Implementing p(f : ) (f >< b b )POS = solution of f (b >: fpos = arg max b f POS = arg max b b POS = arg max p bfpos b POS How to nd a POS? Simply cycle through its dening equations a stationary point is reached. until jg Parameter estimation in Bayesian inference problems jg) This is usually very dicult to implement. A sub-optimal criterion, called partial optimal solution (POS): 8 p f b POS b POS jg p bfpos b POS jg POS is weaker than JMAP, i.e., JMAP ) POS, but POS 6) JMAP. Page 101

102 )MML = arg max (b p(gj b ) ) ( Parameter estimation in Bayesian inference problems Implementing the marginal ML criterion: the EM algorithm. Recall the the MML criterion is Z p(g fj ) df arg max = ) ( Usually, it is infeasible to obtain the marginal likelihood analytically. Alternative: use the expectation-maximization (EM) algorithm. Page 102

103 Compute the so-called Q-function. This is the expected E-Step: of the logarithm of the complete likelihood function, given value Parameter estimation in Bayesian inference problems The EM algorithm: the current parameter estimates Z Q( jb (n) b (n) ) = p(fjg b (n) b (n) )logp(g fj ) df M-Step: Update the parameter estimate according to (n+1) = arg max b b Q( j b(n) b (n) ): ) ( Under certain (mild) conditions, b b (n) = b b lim n!1 MML Page 103

104 We also choose p() = k 1 and p( 2 ) = k 2, i.e., we will look for estimates of these parameters. ML-type Back to the image restoration problem. We have a prior p(fjh v ) We have an observation model p(gjf 2 ) N (f 2 I). We have unknown parameters, 2, h, and v Our complete set of unknowns is (f 2 h v) We need a hyper-prior p( 2 h v) It makes sense to assume independence p( 2 h v) = p() p( 2 ) p(h) p(v) Page 104

105 A natural parametrization of the edge variables uses the locations of that are equal to 1, which are usually a small minority. those h i j = 1, (i j) 2 h (k h ) v i j = 1, (i j) 2 v (k v ) Example: if h 2 5 = 1, h 6 2 = 1, v 3 4 = 1, v 5 7 = 1, and v 9 12 = 1, then h = 2, k v = 3, and k Reparametrization of the edge variables. Let h (k h ) and v (k v ) be dened according to h (k h ) contains the locations of the k h variables h i j that are set to 1. Similarly for v (k v ) with respect to the v i j 's. h (2) = [(2 5) (6 2)] v (3) = [(3 4) (5 7) (9 12)] Page 105

106 Reparametrization of the edge variables. We have two unknown parameter vectors: h (k h ) and v (k v ). These parameter vectors have unknown dimension, k h =?, k v =? We have a \model selection problem" This justies another detour: model selection. Page 106

107 We have two parameter vectors h (k ) and v (k v ) of unknown h dimension. (k ) h h (k ) v v With this MDL \prior" we can now estimate, 2, h (k ), v (k v ), h most importantly, f. and, Returning to our image restoration problem. The natural description length is L = k h (log M + log N) L = k v (log M + log N) where we are assuming the image size is M N. Page 107

108 Noisy image (b) discontinuity-preserving restoration (c) signaled (a) and (d) restoration without preserving discontinuities. discontinuities Example: discontinuity-preserving restoration Using the MDL prior for the parameters, and the POS criterion. a) c) b) d) Page 108

109 Alternative to explicitly detection/preservation of edges: replace the potentials by \less aggressive" functions. quadratic Discontinuity-preserving restoration: implicit discontinuities Clique-potentials, for rst order auto-model V f(i j) (i j;1)g (f ij f ij;1 ) = '(f ij ; f ij;1 ) V f(i j) (i;1 j)g (f ij f i;1 j ) = '(f ij ; f i;1 j ) where '() is no longer a quadratic function. Several '()'s have been proposed: convex and non-convex. Page 109

110 Generalized Gaussians (Bouman and Sauer [16]), '(x) = jxj p, with 2 [1 2] (for p = 2 ) GMRF). p Discontinuity-preserving restoration: convex potentials Stevenson et al. [80] proposed 8 < '(x) = x2 ( jxj < 2jxj ; 2 ( jxj < : The function proposed by Green [40], '(x) = 2 2 log cosh(x=). Approximately quadratic for small x linear, for large x. Parameter controls the transition between the two behaviors. Page 110

111 Discontinuity-preserving restoration: convex potentials Page 111

112 The one suggested by Hebert and Leahy [45] is = log ; 1 + (x=) 2. '(x) Discontinuity-preserving restoration: non-convex potentials Radically dierent from the quadratic: they atten, for large arguments. Blake and Zisserman's '(x) = (minfjxj g) 2 [15], [30] The one proposed by Geman and McClure [35]: '(x) = x 2 =(x ) Geman and Reynolds [36] proposed: '(x) = jxj=(jxj + ). Page 112

113 Discontinuity-preserving restoration: non-convex potentials Page 113

114 fmap = arg max b f 1 p (g) exp f;u p(fjg)g Z Except in very particular cases (GMRF prior and Gaussian noise) is no analytical solution. there Optimization Problems By far the most common criterion in MRF applications is the MAP. This requires locating the mode(s) of the posterior p(f jg) arg max = f U p (fjg) arg min = f where U p (fjg) is called the a posteriori energy. Finding a MAP estimate is then a dicult task. Page 114

115 1 exp f;u p (fjg)g p Z T! 0, the set of maximizing congurations (denoted 0 ) gets one. Formally probability 1 ( f 2 0 j0j Optimization Problems: Simulated Annealing Notice that U p (fjg) can be multiplied by any positive constant max arg f jg) = arg max p(f f arg max = f 1 p (T ) exp ; U p(fjg) Z T p(fjg T): arg max = f By analogy with the Boltzman distribution, T is called temperature. T! 1, p(fjg T) becomes at: all congurations equiprobable. 8 < p(fjg T) = : lim!0 T 0 ( f 62 0 : Page 115

116 Question: How to simulate a system with equilibrium { p(fjg T)? distribution Question: How to \cool it down" without destroying the { equilibrium? Optimization Problems: Simulated Annealing Simulated annealing (SA): exploits this behavior of p(fjg T) { Simulate a system whose equilibrium distribution is p(fjg T) { \Cool" it until the temperature reaches zero. Implementation issues of SA: Answer: Metropolis algorithm or Gibbs sampler. Answer: later. Page 116

117 T g. G f(t) c be the probability of the candidate conguration c, given Let current f(t). the The Metropolis algorithm. a system with equilibrium distribution Simulating / expf; U(f) p(f T) Starting state f(0) Given the current state f(t), a random \candidate" c is generated. The candidate c is accepted with probability A f(t) c (T ). A(T ) = [A f c (T )] is the acceptance matrix.. The new state f(t + 1) only depends on f(t) this is a Markov chain. Page 117

118 Under certain conditions on G and A(T ), the equilibrium of is distribution = A f 0 f (T ) p(f T) f0 v(t ) where f : A U(f0 ) ; U(f) The Metropolis algorithm X v2 Usual choice A f(t) c (T ) = min exp 1 U(f(t)) ; U(c) T leading to ;U(f) T p(f T) / exp / exp T Page 118

119 Replaces the generation/acceptance mechanism by a simpler one the Markovianity of p(f) exploiting Generate f(t + 1) by replacing f i (t) by a random sample of its probability, with respect to p(f T). All other elements conditional The Gibbs sampler Current state f(t) Choose a site (i.e., an element of f(t)), say f i (t). are unchanged. If every site is visited innitely often, the equilibrium distribution is p(f) / expf; U(f) again g T Page 119

120 The temperature evolves according to T (t), called the \cooling schedule". C + 1) log(t Simulated annealing: cooling The cooling schedule must verify 1X K ; exp = 1 T (t) t=1 where K is a problem-dependent constant Best known case: T (t) = with C K. Page 120

121 The visited site is replaced by the maximizer of its conditional, the current state of its neighbors. given Iterated conditional modes (ICM) algorithm It is a Gibbs sampler at zero temperature. Advantage: extremely fast. Disadvantage: convergence to local maximum. Sometimes, this may not really be a disadvantage. Page 121

122 b fmpm = f 1 Simply simulate (i.e., sample from) p(f jg) using the Gibbs sampler the Metropolis algorithm. or these (estimated) marginal distributions, the MPM is From obtained. easily Implementing the PM and MPM criterion Recall: maximizer of posterior marginals (MPM) p(f m jg) T max arg 1 jg) arg max p(f 1 f 2 jg) arg max p(f m f The posterior mean (PM) b fpm = E[fjg]. Collect statistics: { For the PM, site-wise averages approximate the PM estimate. { For the MPM, collect site-wise histograms These histograms are estimates of the marginals p(f i jg). Page 122

123 MRFs are plagued by the diculty of computing the partition functions. The Partition Function Problem This is specially true for parameter estimation. Few exceptions: GMRF and Ising elds. This issue is dealt with by applying approximation techniques. Page 123

124 This approximation was used in the example shown above on restoration with CGMRF's and MDL priors discontinuity-preserving Approximating the partition function: Pseudo-likelihood Besag's pseudo-likelihood approximation: p(f) ' Y i2n p(f i jf N(i) ) Page 124

125 f;u(f)g exp Z ' Y i2n f;u MF i (f i )g exp MF i Z The quantity U MF i (f i ) is the mean eld local energy: MF i (f i ) = X U Approximating the partition function: Mean eld Imported from statistical physics. The exact function is approximated by a factored version p(f) = V C (f i fe MF [f k ] : k 2 Cg) C: i2c where MF [f k ] = X f k E MF Z f;u MF k (f k )g exp k We replace the neighbors of each site by their (frozen) means. Page 125

126 To obtain U MF i (f i ) we need its neighbors mean values. { symmetrical), thus on U MF i (f i ) itself. are As a consequence, the MF approximation has to be obtained iteratively. Alternative: the mean of each site E MF [f i ] is approximated by the saddle point approximation. mode: Mean eld approximation (cont.) There is a self-referential aspect in the previous equations: { These, in turn, depend on E MF [f i ] (since neighborhood relations Page 126

127 Continuation methods: the objective function U(f jg) is embedded a family in Deterministic optimization: Continuation methods fu(fjg ) 2 [0 1]g such that U(fjg 0) is easily minimizable U(fjg 1) = U(fjg). Procedure: { Find the minimum of U(fjg 0) this is easy { Track that minimum while (slowly) increases up to 1. Page 127

128 A discrete set of n values { 0 = 0 1 ::: t ::: n;1 n = 1g [0 1] is chosen f for each t, U(fjg t ) is minimized by some local iterative { technique. This iterative process is initialized at the previously obtained { for t;1. minimum Writing T = ; log reveals that simulated annealing shares some of spirit of continuation algorithms. the annealing can be called a \stochastic continuation Simulated method". Deterministic optimization: Continuation methods The \tracking" is usually implemented as follows: Page 128

129 fact that these must obtained iteratively is exploited to insert The computation into a continuation method its Continuation methods: Mean eld annealing (MFA) MFA is a deterministic surrogate of (stochastic) simulated annealing. p(fjg T) is replaced by its MF approximation. Computing the MF approximation, nding the MF values. Page 129

130 At (nite) temperature T t, the mean eld values E MF t [f k jg T t ] are As T (t)! 0, the MF approximation converges to a distribution on its global maxima. concentrated Continuation methods: Mean eld annealing (MFA) For T! 1, p(fjg T) and its MF approximation are uniform. The mean eld is trivially obtainable. obtained iteratively. This iterative process is initialized at the previous mean eld values MF t;1[f k jg T t;1 ] E Alternatively, temperature descent is stopped at T = 1. This yields a MF approximation of p(fjg T = 1) = p(fjg) Mean eld values are (approximate) PM estimates. Page 130

131 This is the case of most discontinuity-preserving MRF priors, for g ' 0, the potentials have convex behavior. because The example shown above, of discontinuity-preserving restoration CGMRF and MDL priors, uses this continuation method. with Continuation methods: Simulated tearing (ST) Uses the following family of functions fu(fjg ) = U(fj g) 2 [0 1]g : Obviously, U(fjg 1) = U(fjg) This method is adequate when U(fj0) is easily minimizable. Page 131

132 Important topics not covered: Specic discrete-state MRF's: Ising, auto-logistic, auto-binomial... Multi-scale MRF models. Causal MRF models. Closer look at applications (see references). Page 132

133 Bayesian theory. See accompanying text and the many references Fundamental therein. inference. General concepts: [7], [74]. In computer vision/image Compound recognition: [4], [44], [65]. The multivariate Gaussian case, from a analysis/pattern Random elds on graphs: [41], [42], [71], [79] (and references therein), and [81]. (these are good sources for further references): [19], [62], [83], [42]. See also Books [41]. papers on MRFs for image analysis and computer vision: [11], [18], [20], Inuential [23], [24], [25], [26], [33], [49], [61], [65], [68], [85], [86]. [21], Compound Gauss-Markov Random elds and applications: [28], [48], [49]. [90]. estimation: [6], [55], [28], [39], [50], [56], [64], [67], [84], [91], [94], [5], and Parameter references in [62]. further Some references (this is not an exhaustive list): signal processing perspective: [76]. Markov random elds on regular graphs: Seminal papers: [33], [11]. Earlier work: [82], [75], [85], [86], [9], [52]. Page 133

Markov Random Fields

Markov Random Fields Umamahesh Srinivas ipal Group Meeting February 25, 2011 Outline 1 Basic graph-theoretic concepts 2 Markov chain 3 Markov random field (MRF) 4 Gauss-Markov random field (GMRF), and