Most image analysis problems are \inference" problems: f g \Inference" Observed image Inferred image For example, \edge detection": \Inference" Page 2

Size: px
Start display at page:

Download "Most image analysis problems are \inference" problems: f g \Inference" Observed image Inferred image For example, \edge detection": \Inference" Page 2"

Transcription

1 Methods and Bayesian Random Fields Markov Department of Electrical and Computer Engineering Superior Tecnico Lisboa, PORTUGAL Thanks: Anil K. Jain and Robert D. Nowak, Michigan State University, USA

2 Most image analysis problems are \inference" problems: f g \Inference" Observed image Inferred image For example, \edge detection": \Inference" Page 2

3 The word \image" should be understood in a wide sense. Examples: Conventional image CT image Flow image Range image Page 3

4 Examples of \inference" problems Edge detection Image restoration Template matching Contour estimation Page 4

5 Main features of \image analysis" problems They are inference problems, i.e., they can be formulated as: \from g, infer f" They can not be solved without using a priori knowledge. Both f and g are high-dimensional. (e.g., images). They are naturally formulated as statistical inference problems. Page 5

6 the Bayesian approach provides a way to \invert" an Basically, model, taking prior knowledge into account. observation Introduction to Bayesian theory g f Observation model Unknown Observed f b Inferred knowledge knowledge Bayesian decision Inferred = estimated, or Loss function detected, or classied,... Page 6

7 Since we are considering one particular patient, this tumor". has no frequential meaning it expresses a degree of belief. statement It can be shown that probability theory is the right tool to formally with \degrees of belief" or \knowledge" deal The Bayesian philosophy Knowledge, probability A subjective (non-frequentist) interpretation of probability. Probabilities express \degrees of belief". Example: \there is a 20% probability that a certain patient has a Cox (46), Savage (54), Good (60), Jereys (39, 61), Jaynes (63, 68, 91). Page 7

8 L(f b f) quantity Inferred f b Bayesian decision theory Knowledge about f Observation model Loss function p(gjf) p(f) Bayesian decision theory rule Decision f = (g) b Observed data g An \algorithm" Page 8

9 The \conditionality principle": any inference must be based on the observed data (g). (conditioned) The \likelihood principle": The information contained in the g can only be carried via the likelihood function p(fjg). observation knowledge about f, once g is observed, is expressed by the Accordingly, posteriori (or posterior) probability function: a p(f) p(gjf) p(g) How are Bayesian decision rules derived? By applying the fundamental principles of the Bayesian philosophy: Knowledge is expressed via probability functions. p(fjg) = \Bayes law" Page 9

10 fbayes = b (g) = arg min Bayes f b df p(f) b fjg p(fjg) f) L(f b How are Bayesian decision rules derived? (cont.) Once g is observed, knowledge about f is expressed by p(fjg). Given g, what is the expected value of the loss function L(f b f)? Z hl(f b f)jg i E =...the so-called \a posteriori expected loss". An \optimal Bayes rule", is one minimizing : fjg p(f) b p(f) b fjg Page 10

11 How are Bayesian decision rules derived? (cont.) Prior p(f) Bayes Law p(fjg) = p(f) p(gjf) p(g) Likelihood p(gjf) Loss posteriori expected loss" \a p(f) b fjg hl(f b f)jg i Bayesians, \Pure here! Report stop L(f b f) = E the posterior" min arg f b p(f) b fjg rule Decision f = (g) b Minimize \Bayesian image processor" Page 11

12 p(f) p(gjf) p(g) More on Bayes law. p(fjg) = The numerator is the joint probability of f and g: p(gjf) p(f) = p(g f): The denominator is simply a normalizing constant, Z Z p(g) = p(g f) df = p(gjf) p(f) df...it is a marginal probability function. Other names: unconditional, predictive, evidence. In discrete cases, rather than integral we have a summation. Page 12

13 The \0/1" loss function For a scalar continuous f 2 F, e.g., F = IR, 8 < b 1 ( ; b jf fj " L (f f) = " fj < " b ; jf ( : 0 "0/1" loss function, with ε = Loss f δ(g) Page 13

14 " (g) = arg min ZF d arg = p(fjg) min df Zf:jf;dj" d Z d+" The \0/1" loss function (cont.) Minimizing the \a posteriori" expected loss: L " (f d) p(fjg) df Z p(fjg) df! arg min = 1 ; d f:jf;dj<" p(fjg) df arg max = d d;" Page 14

15 Z d+" p(fjg) MAP (g) b fmap Letting " approach zero, p(fjg) df lim "(g) = lim arg max d "!0 "!0 d;" arg max = f...called the \maximum a posteriori" (MAP) estimator. " decreasing, " (g) \looks With the highest mode of p(fjg) for" Page 15

16 arg min = f;p(djg) + X d f2f p(fjg) MAP (g) b fmap The \0/1" loss for a scalar discrete f 2 F 8 < 1 b ( f 6= b f L(f f) = = b f f ( : 0 Again, minimizing the \a posteriori" expected loss: L " (f ; d) p(fjg) = arg min (g) d X f2f p(f jg) arg min = d X f6=d g p(f jg) {z } 1 arg max = f...the \maximum a posteriori" (MAP) classier/detector. Page 16

17 f ; b f 2 arg min = fe f 2 jg {z } d = E [fjg] b fpm \Quadratic" loss function For a scalar continuous f 2 F, e.g., F = IR, f b f L = Minimizing the a posteriori expected loss, PM (g) = arg min d E h d) 2 i jg ; (f + d 2 ; 2 de[fjg]g Constant...the \posterior mean" (PM) estimator. Page 17

18 Example: Gaussian observations with a Gaussian prior. The observation model is p(gjf) = p([g 1 g 2 :::g n ] T jf) N ([f f :::f] T 2 I) ; 2 ( ;n=2 exp ; 1 2 = 2 2 nx i=1 (g i ; f) 2 ) where I denotes an identity matrix. The prior is N (0 2 ) = 2 2 ;1=2 exp ; f 2 p(f) 2 ; 2 From these two models, the posterior is simply p(f jg) N 2 g 2 n + 2 n ;1! with g = g 1 + ::: + g n n Page 18

19 2 bf MAP = b fpm = g n n Example: Gaussian observations with a Gaussian prior (cont.). As seen in the previous slide p(fjg) N g n ;1! n Then, since the mean and the mode of a Gaussian coincide, the estimate is a \shrunken" version of the sample mean g. If the prior had mean, we would have MAP = b fpm = 2 n + g2 bf i.e., the estimate is a weighted average of and g Page 19

20 n + 2 Example: Gaussian observations with a Gaussian prior (cont.). Observe that 2 n + g ng 2 lim n!1 lim = n!1 2 + n 2 = g i.e., as n increases, the data dominates the estimate. The posterior variance does not depend on g, h b i 2 jg f) ; (f n ;1 E = inversely proportional to the degree of condence on the estimate. Notice also that lim h ; b 2 i f) jg (f E n!1 n ;1 = 0 lim = n!1...as n! 1 the condence on the estimate becomes absolute. Page 20

21 1.2 Estimates Sample mean MAP estimate with φ 2 = 0.1 MAP estimate with φ 2 = Number of observations 10 1 A posteriori variance, with φ 2 = 0.1 A posteriori variance, with φ 2 = 0.01 Variances Number of observations Page 21

22 Example: Gaussian mixture observations with a Gaussian prior. \Mixture" observation model p ; s (g ; )2 ; 2 2 exp exp (g ; s)2 ; p 2 ; p(gjs) = + + g s 8 < w/ prob. n N (0 2 ) 0 w/ prob. (1 ; ) : Gaussian prior p(s) N (0 2 ). Page 22

23 ; ; Example: Gaussian mixture observations with a Gaussian prior (cont.). The posterior: ; s ; )2 (g 2 2 s2 ; 2 2 ; s)2 (g 2 2 s2 ; 2 2 p(sjg) / exp + (1 ; )exp Example: = 0:6 p S (s g=0.5) = 4 g = 0: MAP PM s 2 = 0:5 PM = \compromise" MAP = largest mode. Page 23

24 MAP = arg max bf f p(f) p(gjf) p(g) Improper priors and \maximum likelihood" inference Recall that the posterior is computed according to p(fjg) = If the MAP criterion is being used, and p(f) = k, p(gjf) k Z k p(gjf) df p(gjf) arg max = f...the \maximum likelihood" (ML) estimate. In the discrete case, simply replace the integral by a summation. Page 24

25 If the space to which f belongs is unbounded, e.g., f 2 IR m, or 2 IN, the prior is \improper": f X p(f) = k X 1 = 1: Improper priors reinforce the \knowledge" interpretation of probabilities. Improper priors and maximum likelihood inference (cont.) Z Z p(f) df = k df = 1: or If the posterior is proper, all the estimates are still well dened. Page 25

26 fbayes = b (g) = arg min Bayes f b Compound inference: Inferring a set of unknowns Now, f is a (say, m-dimensional) vector, f = [f 1 f 2 ::: f m ] T : Loss functions for compound problems: Additive: Such that L(f b f) = M X i=1 L i (f i b fi ). Non-additive: This decomposition does not exist. Optimal Bayes rules are still Z L(f b f)p(fjg) df Page 26

27 fmap = b (g) = arg max MAP f Compound inference with non-additive loss functions. There is nothing fundamentally new in this case. The \0/1" loss, for a vector f (e.g., F = IR m ): 8 < 1 ( k ; f b f k " L (f b f) = " f k< " ;b f k ( : 0 Following the same derivation yields p(f jg) the MAP estimate is the joint mode of the a posteriori i.e., function. probability Exactly the same expression is obtained for discrete problems. Page 27

28 The loss, for f 2 IR m : L(f b f) = (f ; b T f) Q(f ;b f) quadratic is a symmetric positive-denite (m m) matrix. Q where arg = fe min f T Qfjg {z } f b = E [fjg] b fpm +b f T Qb f ; 2 b f T Q E [fjg]g Remarkably, this is true for any symmetric positive-denite Q. case: Q is diagonal, the loss function is additive. Special Compound inference with non-additive loss functions (cont.) Minimizing the a posteriori expected loss, h(f ;b f) T Q(f ;b f)jg i PM (g) = arg min f b E Constant = solution of (Q has inverse) nqb f = QE [fjg] o...still the \posterior mean" (PM) estimator. Page 28

29 Z mx i=1 mx mx p(fjg) df ;i df i Compound inference with additive loss functions Recall that, in this case, L(f b f) = M X i=1 L i (f i b fi ). The optimal Bayes rule i (f i b fi ) p(fjg) df L Bayes = arg min (g) f b {z } L(f bf) Z L i (f i b fi ) p(fjg) df arg min = f b i=1 Z Z L i (f i b fi ) arg min = f b i=1 df ;i denotes df 1 :::df i;1 df i+1 :::df m, that is, integration with where to all variables except f i respect Page 29

30 is, b fibayes = arg min that f i b b f mx mx p(fjg) df ;i Conclusion: each estimate is the minimizer of the corresponding a posteriori expected loss marginal df i Compound inference with additive loss functions (cont.) From the previous slide: Z Z L i (f i b fi ) Bayes = arg min (g) f b i=1 Z But, p(fjg) df ;i = p(f i jg), the a posteriori marginal of variable f i. Z Then, (g) Bayes = arg min L i (f i b fi )p(f i jg) df i, i=1 Z i = 1 2 :::m L i (f i b fi )p(f i jg) df i Page 30

31 L(f b f) = m X i=1(f i ; b fi ) 2 = (f ;b f) T (f ;b f). p(f m jg) T Additive loss functions: Special cases The additive \0/1" loss function: L(f b f) = m X i=1 L i (f i b fi ), where each L i (f i b fi ) is a \0/1" loss function for scalar arguments. According to the general result, max arg b fmpm = f 1 1 jg) arg max p(f 2 f 2 jg) arg max p(f m f the maximizer of posterior marginals (MPM). The additive quadratic loss function: The general result for quadratic loss functions is still valid. This is a natural fact because the mean is intrinsically marginal. Page 31

32 Observation model: linear operator (matrix) plus additive white noise: Gaussian = p p(f) exp ; 1 (2);n=2 f T K ;1 f 2 det(k) Example: Gaussian observations and Gaussian prior. g = Hf + n where n N (0 2 I) Corresponding likelihood function p(gjf) = (2 2 ) ;n=2 exp 1 ; 2 k Hf ; g k2 2 Gaussian prior: Page 32

33 f = arg min b f f T 2 K ;1 + H T H f ; 2f T H T g Example: Gaussian observations and Gaussian prior (cont.) The a posteriori (joint) probability density function is still Gaussian bf P p(fjg) N with b f being the MAP and PM estimate, given by = 2 K ;1 + H T H ;1 H T g: This is also called the (vector) Wiener lter. Page 33

34 b f = H T H ;1 H T g: If H is not invertible, H y provides its least-squares sense pseudo-solution. Example: Gaussian observations and Gaussian prior special cases. No noise: Absence of noise, 2 = 0 n ; gk 2o khf arg min = f H T H ;1 H T H y is called the Moore-Penrose pseudo generalized) inverse of matrix H. (or If H ;1 exists, H y = H ;1 This estimate is also the maximum likelihood one. Page 34

35 K ;1 = B ;1 = 2 is positive denite ) exists unique symmetric D that DD = D T D = B ;1. such In regularization theory parlance, kdfk 2 is called the regularizing and 2 = 2 the regularization parameter. term, Example: Gaussian observations and Gaussian prior special cases. covariance up to a factor: K = 2 B diagonal elements of B Prior to 1. 2 can be seen as the \prior variance". equal This allows writing f = arg min b f ; Hfk kg 2 kdfk2 We can also write b f = 2 2 B;1 + H T H ;1 H T g 2 = 2 controls the relative weight of the prior and the data. Page 35

36 of what we have seen up to this point Summary Image analysis problems are inference problems Introduction to Bayesian inference: Fundamental principles: knowledge as probability, likelihood and { conditionality. { Fundamental tool: Bayes rule. { Necessary models: observation model, prior, loss function. { A posteriori expected loss and optimal Bayes rules. { The \0/1" loss function and MAP inference. { The quadratic error loss function and posterior mean estimation. { Example: Gaussian observations and Gaussian prior. { Example: Mixture of Gaussians observations and Gaussian prior. { Improper priors and maximum likelihood (ML) inference. { Compound inference: additive and non-additive loss functions. { Example: Gaussian observations with Gaussian prior. Page 36

37 Sometimes the prior knowledge is vague enough to allow tractability to come into play. concerns In other words: choose priors compatible with knowledge, but to a tractable a posteriori probability function. leading p(gjf) p(fj) ) = p(fjg) p(g) Conjugate priors: Looking for computational convenience Conjugate priors formalize this goal. A family of likelihood functions L = fp(gjf) f 2 Fg A (parametrized) family of priors P = fp(fj) 2 g P is a conjugate family for L, if 8 < 9 = p(gjf) 2 L p(f j) 2 P : 2 P i.e., 9 0 2, such that p(fjg) = p(fj 0 ). Page 37

38 P = p(fj 2 ) N ( 2 ) ( 2 ) 2 IR IR + Very important: computing the a posteriori probability function involves \updating" parameters of the prior. only Conjugate priors: A simple example The family of Gaussian likelihood functions of common variance L = p(gjf) N (f 2 ) f 2 IR The family of Gaussian priors of arbitrary mean and variance The a posteriori probability density function is 2 + g p(fjg) N 2 P Page 38

39 + ) ;( ;1 (1 ; ) ;1 ;();() Conjugate priors: Another example is the (unknown) \heads" probability of a given coin. Outcomes of a sequence of n tosses: x = (x 1 ::: x n ), x i 2 f1 0g. Likelihood function (Bernoulli), with n h (x) = x 1 + x 2 + ::: + x n, p(xj) = n h(x) (1 ; ) n;n h(x) : A priori belief: \ should be close to 1=2". Conjugate prior: the Beta density p(j ) Be( ) = dened for 2 [0 1] and > 0. Page 39

40 + + + ) 2 ( + + 1) ( ; 1 + ; 2 Conjugate priors: Bernoulli example (cont.) Main features of Be( ): E[j ] = (mean) " ; 2 # E (variance) = max arg (mode, if > 1) p(j ) = \Pull" the estimate towards 1=2: choose =. The quantity = controls \how strongly we pull". Page 40

41 = 1, qualitatively dierent behavior: the mode at 1=2 For disappears. Conjugate priors: Bernoulli example (cont.) Several Beta densities: Beta prior α = β = 1 α = β = 2 α = β = 10 α = β = θ Page 41

42 PM = PM (x) = + n h(x) b + + n MAP = MAP (x) = + n h(x) ; 1 b + + n ; 2 : Conjugate priors: Bernoulli example (cont.) The a posteriori distribution is again Beta p(jx ) Be( + n h (x) + n ; n h (x)) Bayesian estimates of Page 42

43 of the a posteriori densities, for a Be(5 5) prior (dotted line) Evolution Be(1 1) at prior (solid line). and Conjugate priors: Bernoulli example (cont.) n= n= n= n= n= n= Page 43

44 n i.i.d. zero-mean Gaussian observations of unknown variance 2 = 1= ny n ( 2 exp ; 2 ;() ;1 exp f;g nx x 2 i Conjugate priors: Variance of Gaussian observations Likelihood function r exp ; x2 i 2 2 ) f(xj) = = : i=1 2 i=1 Conjugate prior: the Gamma density. p(j ) Ga( ) = for 2 [0 1) (recall = 1= 2 ) and > 0. Page 44

45 E[j ] = ) = ; 1 p(j Conjugate priors: Variance of Gaussian observations (cont.) Main features of the Gamma density: (mean) " 2 # ; E = 2 (variance) (mode, if 1 ) max arg Page 45

46 Conjugate priors: Variance of Gaussian observations (cont.) Several Gamma densities: Gamma prior α = β = 1 α = β = 2 α = β = 10 α = 62.5, β = θ Page 46

47 p(jx 1 x 2 ::: x n ) Ga + n b PM = b MAP = n n nx n n b MAP = b ML = n x 2 i nx! ;1 nx nx x 2 i x 2 i! ;1 Conjugate priors: Variance of Gaussian observations (cont.) A posteriori density:! : i=1 The corresponding Bayesian estimates n i= ; 2 n n x 2 i! ;1 : i=1 Both estimates converge to the ML estimate: PM = lim b n!1 lim n!1 i=1 Page 47

48 The von Mises Theorem As long as the prior is continuous and not zero at the location of the ML estimate, then, the MAP estimate converges to the ML estimate as the number of data points n increases. Page 48

49 f (m) jg) = p(gjf (m) m) p(f (m) m) p(m p(g) p(gjf (m) m) p(f (m) jm) p(m) = p(g) Bayesian model selection Scenario: there are K models available, i.e., m 2 fm 1 ::: m K g Given model m, Likelihood function: p(gjf (m) m) Prior: p(f (m) jm) Under dierent m's, f (m) may have dierent meanings, and sizes. A priori model probabilities fp(m) m = m 1 ::: m K g. The a posteriori probability function is Page 49

50 Seen strictly as a model selection problem, the natural loss function the \0/1" with respect to the model, i.e. is Z p(m) p(gjf (m) m) p(f (m) jm) df (m) Main diculty: improper priors (for p(f (m) jm)) are not valid, they are only dened up to a factor. because 8 < L[(m f (m) ) (bm b f( bm) )] = 0 ( bm = m 1 ( bm 6= m : The resulting rule is the \most probable mode a posteriori" Z = arg max bm m p(m f (m) jg) df (m) p(mjg) = arg max m arg max = m arg maxfp(m) p(gjm) g = m {z } Evidence Page 50

51 Comparing two models: which of m 1 or m 2 is a posteriori more likely? 1 jg) p(m 2 jg) = p(gjm 1 ) p(gjm 2 ) p(m 1 ) p(m 2 ) p(m Bayesian model selection Answer is given by the so-called \posterior odds ratio" {z } factor" \Bayes' {z } odds ratio" \prior Bayes' factor = evidence, provided by g, for m 1 versus m 2. Page 51

52 a sequence of binary variables (e.g., coin tosses) comes from two Does sources? dierent m 1 =\all g i 's come from the same i.i.d. binary source with = " (e.g., same coin). Prob(1) m 2 =\[g 1 ::: g t ] and [g t+1 ::: g 2t ] come from two dierent sources Prob(1)= and Prob(1) =, respectively" with Notice that with =, m 2 becomes m 1 Bayesian model selection: Example Observations: g = [g 1 ::: g t g t+1 ::: g2t], with g i 2 f0 1g. Competing models: (e.g., two coins with dierent probabilities of \heads"). Parameter vector under m 1, f (m1 ) = [] Parameter vector under m 2, f (m2 ) = [ ] Page 52

53 g i (1 ; ) 1;g i = n(g) (1 ; ) 2t;n(g) p(gj m 2 ) = n 1(g) (1 ; ) t;n 1(g) n 2(g) (1 ; ) t;n 2(g) n 1 (g) and n 2 (g) are the numbers of ones in the rst and where halves of the data, respectively. second Bayesian model selection: Example (cont.) Likelihood function under m 1 : Y2t p(gj m 1 ) = i=1 where n(g) is the total number of 1's. Likelihood function under m 2 : Notice that n 1 (g) + n 2 (g) = n(g). Page 53

54 These two priors mean: \in any case, we know nothing about the parameters". Bayesian model selection: Example (cont.) Prior under m 1 : p(jm 1 ) = 1 for 2 [0 1] Prior under m 2 : p( jm 2 ) = 1 for ( ) 2 [0 1] [0 1] Page 54

55 Z 1 1 Z 1 Z (t ; n 1(g)! n 1 (g)! = + 1)! (t ; n 2 (g)! n 2 (g)! (t + 1)! (t ; n(g))! n(g)! (2t + 1)! (2a Bayesian model selection: Example (cont.) Evidence in favor of m 1 (recall that p(jm 1 ) = 1) p(m 1 jg) = n(g) (1 ; ) 2t;n(g) d = 0 Evidence in favor of m 2 (recall that p( jm 2 ) = 1): p(m 2 jg) = n 1(g) (1 ; ) t;n 1(g) n 2(g) (1 ; ) t;n 2(g) d d 0 0 Page 55

56 regions for all possible outcomes with 2t = 100, and Decision 1 ) = p(m 2 ) = 1=2. p(m Bayesian model selection: Example (cont.) m2 (two sources) n m1 (same source) m2 (two sources) n1 Page 56

57 Bayesian model selection: Another example Segmenting a sequence of binary i.i.d. observations: Trials Log of Bayes factor Log of Bayes factor Log of Bayes factor First segmentation Candidate location 0 Segmentation of left segment Candidate location 0 Segmentation of right segment Candidate location Is there a change of model? Where? Page 57

58 By using a Taylor expansion of the likelihood, around the ML and for a \smooth enough" prior, we have estimate, ' p(gjb f(m) m) n ; dim(f (m) ) 2 BIC(m) p(gjm) b f(m) is the ML estimate, under model m. log (BIC(m)) = ; log p(gjb f(m) m) + dim(f (m)) ; 2 Model selection: Schwarz's Bayesian inference criterion (BIC) Often, it is very dicult/impossible to compute p(gjm). dim(f (m) ) =\dimension of f (m) under model m". n is the size of the observation vector g. Let us also look at log n Page 58

59 i.e., if k was known, we could nd the ML estimate b f(k) selection: Rissanen's minimum description length Model (MDL) Consider an unknown f (k) of unknown dimension k. Data is observed according to p(gjf (k) ) For each k (each model), p(f (k) jk) is constant However, k is unknown, and the likelihood increases with k: k 2 > k 1 ) p(gjb f(k2 )) p(gjb f(k1 )) Conclusion: the ML estimate of k is: \as large as possible" this is clearly useless. Page 59

60 Fact (from information theory): the shortest code-length for data g that it was generated according to p(gjf (k) ) is given Then, for a given k, looking for the ML estimate of f (k) is the same looking for the code for which g has the shortest code-word: as If a code is built to transmit g, based on f (k), then f (k) also has to transmitted. Conclusion: the total code-length is be Minimum description length (MDL) L(gjf (k) ) = ; log 2 p(gjf (k) ) (bits) ) = arg min p(gjf (k) ; log p(gjf(k) ) (k) max arg (k) f arg = L(gjf min (k) ) (k) f L(g f (k) ) = L(gjf (k) ) + L(f (k) ) Page 60

61 Basically, the term L(f (k) ) grows with k counterbalancing the of the likelihood. behavior p(f (k) ) / 2 ;L(f (k)) Minimum description length (MDL) (cont.) The total code-length is L(g f (k) ) = ; log 2 p(gjf (k) ) + L(f (k) ) The MDL criterion: (b k b f( bk) ) MDL min = arg ; log2 p(gjf (k) ) + L(f (k) ) (k) k f From a Bayesian point of view, we have a prior Page 61

62 If the components of f (k) are real numbers (and under certain other the (asymptotically) optimal choice is conditions) In other situations (e.g., discrete parameters), there are natural choices. Minimum description length (cont.) What about L(f (k) )? It is problem-dependent. L(f (k) ) = k 2 log n where n is the size of the data vector g. Interestingly, in this case MDL coincides with BIC. Page 62

63 a polynomial of unknown degree: f (k+1) contains the coecients Fitting a k-order polynomial. of Minimum description length: Example 10 Order = 2 10 Order = 3 10 Order = Order = 15 Observation model: g =\true polynomial plus white Gaussian noise" Order = 4 10 Order = 6 10 Order = Order = Page 63

64 Minimum description length: Example Fitting a polynomial of unknown degree. 1.2 ; log p(gjf (k) ) keeps going down, but MDL picks the right order b k = 4. log likelihood Polynomial order 10 Description length Polynomial order Page 64

65 Markov random elds: a convenient tool to write priors for image problems. analysis Just as Markov random processes formalize temporal evolutions/dependencies. Introduction to Markov Random Fields Image analysis problems, compound inference problems. Prior p(f) formalizes expected joint behavior of elements of f. Page 65

66 A graph G = (N E) is a collection of nodes (or vertices) = fn 1 n 2 :::n jnj g N We consider only undirected graphs, i.e., the elements of E are seen unordered pairs: (n i n j ) (n j n i ). as Two nodes n 1, n 2 2 N are neighbors if the corresponding edge i.e., if (n 1 n 2 ) 2 E. exists, Graphs and random elds on graphs. Basic graph-theoretic concepts and edges E = f(n i1 n i2 ) :::(n i2jej;1 n i2jej )g N N. Notation: jnj = number of elements of set N. Page 66

67 Graphs and random elds on graphs. Basic graph-theoretic concepts (cont.) A complete graph: all nodes are neighbors of all other nodes. A node is not neighbor of itself no (n i n i ) edges are allowed. Neighborhood of a node: N(n i ) = fn j : (n i n j ) 2 Eg. The neighborhood relation is symmetrical: n j 2 N(n i ), n i 2 N(n j ) Page 67

68 Graphs and random elds on graphs. Example of a graph: N = f g E = f(1 2) (1 3) (2 4) (2 5) (3 6) (5 6) (3 4) (4 5)g N N N(1) = f2 3g, N(2) = f1 4 5g, N(3) = f1 4 6g, etc... Page 68

69 other words, a single node or a subset of nodes that are all In neighbors. mutual Graphs and random elds on graphs. Clique of G is either a single node or a complete subgraph of G. Examples of cliques from the previous graph Set of all cliques (from the same example): C = N [ E [ f(2 4 5)g Page 69

70 A length-k path in G is an ordered sequence of nodes, (n 1 n 2 :::n k ), that (n j n j+1 ) 2 E. such Graphs and random elds on graphs. Example: graph a path a and length Page 70

71 We say that C separates A from B if any path from a node in A to node in B contains one (or more) node in C. a Graphs and random elds on graphs. Let A, B, C be three disjoint subsets of N. Example, in the graph C = f1 4 6g separates A = f3g from B = f2 5g Page 71

72 Assign each variable to a node of a graph, N = f1 2 ::: mg. have \random eld on graph N". We Let f A, f B, f C be three disjoint subsets of F (i.e., A, B, and C are subsets of N. If disjoint is global Markov" with respect to N. The graph is called an \p() of p(f) \I-map" Graphs and random elds on graphs. Consider a joint probability function p(f) = p(f 1 f 2 ::: f m ).. p(f A f B jf C ) = p(f A jf C )p(f B jf C ) ( \C separates A from B". Any p(f) is \global Markov" with respect to the complete graph. If rather than (, we have,, the graph is called a \perfect I-map". Page 72

73 Pair-wise Markovianity: (i j) 62 E ) \f i and f j are when conditioned on all the other variables". independent, simply notice that if i and j are not neighbors, the Proof: nodes separate i from j. remaining in the following graph, Example: 1 f 6 jf 2 f 3 f 4 f 5 ) = p(f 1 jf 2 f 3 f 4 f 5 )p(f 6 jf 2 f 3 f 4 f 5 ). p(f Graphs and random elds on graphs. Pair-wise Markovianity f f f f 6 f 2 f 5 Page 73

74 Local Markovianity: i f N=(fig[N(i)) jf N(i) ) = p(f i jf N(i) ) p(f N=(fig[N(i)) jf N(i) ) p(f i f N=(fig[N(i)) jf N(i) ) p(f N=(fig[N(i)) jf N(i) ) p(f Local Markovianity. \given its neighborhood, a variable is independent on the rest". Proof: Notice that N(f i ) separates f i from the rest of the graph. Equivalent form (better known in the MRF literature): p(f i jf N=fig ) = p(f i jf N(i) ) Proof: divide the above equality by p(f N=(fig[N(i)) jf N(i) ): = p(f i jf N(i) ) p(f i jf N=fig ) = p(f i jf N(i) ) because [N=(fig [ N(i))] [ N(i) = N=fig. Page 74

75 If the eld F has the local Markov property, then p(f) can be written a) a Gibbs distribution as Z, the normalizing constant, is called the partition function. where functions V C () are called clique potentials. The negative of the The If p(f) can be written in Gibbs form for the cliques of some graph, b) it has the global Markov property. then consequence: a Markov random eld can be specied via Fundamental clique potentials. the Hammersley-Cliford theorem Consider a random eld F on a graph N, such that p(f) > 0. = p(f) exp ( ; X ) 1 V C (f C ) Z C2C exponent is called energy. Page 75

76 Computing the local Markovian conditionals from the clique potentials Notice that the normalizing constant may depend on the state. neighborhood Hammersley-Cliford theorem (cont.) 1 N(i) ) exp ( ; X Z(f V C (f C ) ) p(f i jf N(i) ) = C:i2C Page 76

77 Let us now focus on regular rectangular lattices. = f(i j) i = 1 ::: M j = 1 ::: Ng N A hierarchy neighborhood systems: 0 (i j) = f g, zero-order (empty neighborhoods) N neighbors) 2 (i j) = f(k l) (i ; k) 2 + (j ; l) 2 2g, order-2 (8 nearest N neighbors) etc... Regular rectangular lattices N 1 (i j) = f(k l) (i ; k) 2 + (j ; l) 2 1g, order-1 (4 nearest Page 77

78 Regular rectangular lattices Illustration of rst order neighborhood system: i;1 j;1 i;1 j i;1 j+1 i j;1 i j 1 j+1 i+1 j;1 i+1 j i+1 j+1 N 1 (i j) = f(i;1 j) (i j;1) (i j+1) (i+1 j)g (4 nearest neighbors). Page 78

79 Regular rectangular lattices Illustration of second order neighborhood system: i;1 j;1 i;1 j i;1 j+1 i j;1 i j 1 j+1 i+1 j;1 i+1 j i+1 j+1 N 2 (i j) = f(i;1 j;1) (i;1 j) (i;1 j+1) (i j;1) (i j+1) (i+1 j;1) (i+1 j) (i+1 j+1)g (8 nearest neighbors). Page 79

80 of a rst order neighborhood system: all single nodes plus all Cliques of the types subgraphs Notation: k = \set of all cliques for the order-k neighborhood system". C Regular rectangular lattices i;1 j i j;1 i j i j Page 80

81 of a second order neighborhood: C 1 plus all subgraphs of the Cliques types Regular rectangular lattices i j;1 i j;1 i;1 j i;1 j i j i j i j+1 i;1 j;1 i j i j i j+1 i j;1 i;1 j i j i+1 j i+1 j Page 81

82 Auto-models Only pair-wise interactions. In terms of clique potentials: jcj > 2 ) V C () = 0. These are the simples models, beyond site independence. Even for large neighborhoods, we can dene an auto-model. Page 82

83 mx 1 ; f T Af 2 exp mx f i f j A ij Matrix A (the potential matrix, inverse of the covariance matrix) the neighborhood system: determines Gauss-Markov Random Fields (GMRF) Joint probability density function (for zero mean) p det(a) m=2 (2) p(f) = The quadratic form in the exponent can be written as f T Af = i=1 j=1 revealing that this is an auto-model (there are only pair-wise terms). i 2 N(j), A ij 6= 0 Page 83

84 that to be a valid potential matrix, A has to be symmetric, Notice respecting the symmetry of neighborhood relations. thus Page 84

85 8 >< ; A ii 2 >: 1 f j A ij A 2 9 >= Gauss-Markov Random Fields (GMRF) Local (Markov-type) conditionals are univariate Gaussian p(f i jff j j 6= ig) = r Aii exp 2 0 fi ; X A ii j6=i > N 0 X A ii j6=i A ij f j 1 1 A ii A Page 85

86 V C (f C ) = 2 (X j2c as long as we dene C j X ; V C (f) = ; 2 C2C = ; 2 C j f j) 2 = 2 (X X j2n C j f j C j C k C j f j) 2 Gauss-Markov Random Fields (GMRF) Specication via clique-potentials: squares of dierences, = 0 ( j 62 C. The exponent of the GMRF density becomes A X C2C! f j f k ; 2 f T Af X X X k2n j2n C2C {z } jk A showing this is a GMRF with potential matrix 2 A Page 86

87 Random Fields (GMRF): The classical \smoothing Gauss-Markov GMRF. prior" A rst order neighborhood j)) = f(i ; 1 j) (i j ; 1) (i + 1 j) (i j + 1)g N((i V f(i j) (i j;1)g (f ij f ij;1 ) = 2 (f ij ; f ij;1 ) 2 V f(i j) (i;1 j)g (f ij f i;1 j ) = 2 (f ij ; f i;1 j ) 2 Matrix A is also quasi-block-toeplitz with quasi-toeplitz blocks. due to boundary corrections. \Quasi-" A lattice N = f(i j) i = 1 ::: M j = 1 ::: Ng Clique set: all pairs of (vertically or horizontally) adjacent sites. Clique-potentials: squares of rst-order dierences, Resulting A matrix: block-tridiagonal with tridiagonal blocks. Page 87

88 Observation model: linear operator (matrix) plus additive white noise, Gaussian There is nothing new: we saw before that the MAP and PM are simply: estimates b f = 2 A + H T H ;1 H T g Bayesian image restoration with GMRF prior: n ; f T o Af 2 A \smoothing" GMRF prior: p(f) / exp where A is as dened in the previous slide. g = Hf + n where n N (0 2 I): Models well: out-of-focus blur, motion blur, tomographic imaging,......only diculty: the matrix to be inverted is huge. Page 88

89 With a \smoothing" GMRF prior and a linear observation model Gaussian noise, optimal estimate: plus b f = 2 A + H T H ;1 H T g A similar result can be obtained in other theoretical frameworks: penalized-likelihood. regularization, 2 A + H T H ;1 H T = H T H ;1 H T H y lim!0 The huge size of 2 A + H T H precludes any explicit inversion. schemes are (almost always) used. Iterative Bayesian image restoration with GMRF prior (cont.) Notice that the (least squares) pseudo-inverse of H. Page 89

90 Original (b) blurred and slightly noisy (c) restored from (b) (a) no blur, severe noise (e) restored from (d). (d) good job. Denoising: oversmoothing, i.e. \discontinuites" Deblurring: smoothed out. are Bayesian image restoration with GMRF prior (cont.) Examples: a) b) c) d) e) Page 90

91 Replace the \square law" potentials by other more \robust" functions. quadratic nature of the a posteriori energy is lost. The optimization becomes much more dicult. Consequence: Solutions to the oversmoothing nature of the GMRF prior. Explicitly detect and preserve discontinuities: compound GMRF weak-membrane, etc... models, A new set of variables comes into play: the edge (or line) eld. Page 91

92 f i;1 j V (f ij f ij;1 v ij ) = 2 (1 ; v ij)(f ij ; f ij;1 ) 2 V (f ij f i;1 j h ij ) = 2 (1 ; h ij)(f ij ; f i;1 j ) 2 Compound Gauss-Markov random elds Insert a binary variables which can \turn o" clique potentials. h i j v i j 2 f0 1g h i j 2 f0 1g f i j;1 v i j f i j fi j New clique potentials: Page 92

93 V (f ij f ij;1 0) = 2 (f ij ; f ij;1 ) 2 V (f ij f i;1 j 0) = 2 (f ij ; f i;1 j ) 2 Compound Gauss-Markov random elds (cont.) The line variables can \turn on" the quadratic potentials, or \turn them o", V (f ij f ij;1 1) = 0 V (f ij f i;1 j 1) = 0 meaning, \there is an edge here, do not smooth!". Page 93

94 Given a certain conguration of line variables, we still have a Gauss random eld Markov b f(h v) = 2 A(h v) + H T H ;1 H T g Compound Gauss-Markov random elds (cont.) p(fjh v) / exp n ; f T o A(h v)f 2 but the potential matrix now depends on h and v. Given h and v, the MAP (and PM) estimate of f has the same form: Question: how to estimate h and v? Hint: h and v are \parameters" of the prior. This motivates a detour on: \how to estimate parameters?" Page 94

95 The likelihood (observation model) depends on parameter(s), i.e., write p(gjf ). we ) p(fj ) p(gjf ) p(gj Parameter estimation in Bayesian inference problems The prior depends on parameter(s), i.e., we write p(f j ). With explicit reference to these parameters, Bayes rule becomes: p(gjf ) p(fj ) p(fjg ) = = Z p(gjf ) p(fj )df Question: how can we estimate and from g, without violating the fundamental \likelihood principle"? Page 95

96 add one more level, consider parameters of the hyper-prior To ) = p( j) p(). And so on... p( Parameter estimation in Bayesian inference problems How to estimate and from g, without violating the \likelihood principle"? Answer: the scenario has to be modied. { Rather than just f there is a new set of unknowns: (f ). { There is a new likelihood function: p(gjf ) = p(gjf ). { A new prior is needed: p(f ) = p(fj ) p( ), because f is independent of. Usually, p( ) is called a hyper-prior. This is called a hierarchical Bayesian setting here, with two levels. Usually, and are a priori independent, p( ) = p() p( ). Page 96

97 ) p(fj ) p( ) p(gjf p(g) Parameter estimation in Bayesian inference problems We may compute a complete a posterior probability function: p(gjf ) p(f ) Z Z Z p(f jg) = p(gjf ) p(f )dfdd = How to use it, depends on the adopted loss function. Notice that, even if f,, and are scalar, this is now a compound inference problem. Page 97

98 f b )JMAP = arg max (b p(f b ) (f f b )JMAP = arg max (b p(f b ) (f Parameter estimation in Bayesian inference problems Non-additive \0/1" loss function L i ) (b f b b ). h(f As seen above, this leads to the joint MAP (JMAP) criterion: jg) With a uniform prior on the parameters p( ) = k, jg) arg = p(gjf max ) p(fj ) ) (f arg = p(g max fj ) (f ) (b f b b )GML sometimes called the generalized maximum likelihood (GML). Page 98

99 fmmap = arg max b f Parameter estimation in Bayesian inference problems A \0/1" loss function, additive with respect to f and the parameters, i.e., h(f ) (b f b b ) i = L (1) h f b f i + L (2) h ( ) (b b ) i L L (1) [ ] is a non-additive \0/1" loss function L (2) [ ] is an arbitrary loss function. From the results above on additive loss functions, the estimate of f is Z Z p(f jg) d d p(f jg) arg max = f the so-called marginalized MAP (MMAP). The parameters are \integrated out" from the a posteriori density. Page 99

100 b )MMAP = arg max (b ) ( Parameter estimation in Bayesian inference problems As in the previous case, let h(f ) (b f b b ) i = L (1) h f b f i + L (2) h ( ) (b b ) i L now, with L (2) [ ] a non-additive \0/1" loss function. Considering a uniform prior p( ) = k, Z p(f jg) df Z p(gjf ) p(fj ) df arg max = ) ( Z arg max = ) ( p(g fj df arg max ) p(gj = ) ) ( the so-called marginal maximum likelihood (MML) estimate. The unknown image is \integrated out" from the likelihood function. Page 100

101 JMAP (b f b b )JMAP = arg max Implementing p(f : ) (f >< b b )POS = solution of f (b >: fpos = arg max b f POS = arg max b b POS = arg max p bfpos b POS How to nd a POS? Simply cycle through its dening equations a stationary point is reached. until jg Parameter estimation in Bayesian inference problems jg) This is usually very dicult to implement. A sub-optimal criterion, called partial optimal solution (POS): 8 p f b POS b POS jg p bfpos b POS jg POS is weaker than JMAP, i.e., JMAP ) POS, but POS 6) JMAP. Page 101

102 )MML = arg max (b p(gj b ) ) ( Parameter estimation in Bayesian inference problems Implementing the marginal ML criterion: the EM algorithm. Recall the the MML criterion is Z p(g fj ) df arg max = ) ( Usually, it is infeasible to obtain the marginal likelihood analytically. Alternative: use the expectation-maximization (EM) algorithm. Page 102

103 Compute the so-called Q-function. This is the expected E-Step: of the logarithm of the complete likelihood function, given value Parameter estimation in Bayesian inference problems The EM algorithm: the current parameter estimates Z Q( jb (n) b (n) ) = p(fjg b (n) b (n) )logp(g fj ) df M-Step: Update the parameter estimate according to (n+1) = arg max b b Q( j b(n) b (n) ): ) ( Under certain (mild) conditions, b b (n) = b b lim n!1 MML Page 103

104 We also choose p() = k 1 and p( 2 ) = k 2, i.e., we will look for estimates of these parameters. ML-type Back to the image restoration problem. We have a prior p(fjh v ) We have an observation model p(gjf 2 ) N (f 2 I). We have unknown parameters, 2, h, and v Our complete set of unknowns is (f 2 h v) We need a hyper-prior p( 2 h v) It makes sense to assume independence p( 2 h v) = p() p( 2 ) p(h) p(v) Page 104

105 A natural parametrization of the edge variables uses the locations of that are equal to 1, which are usually a small minority. those h i j = 1, (i j) 2 h (k h ) v i j = 1, (i j) 2 v (k v ) Example: if h 2 5 = 1, h 6 2 = 1, v 3 4 = 1, v 5 7 = 1, and v 9 12 = 1, then h = 2, k v = 3, and k Reparametrization of the edge variables. Let h (k h ) and v (k v ) be dened according to h (k h ) contains the locations of the k h variables h i j that are set to 1. Similarly for v (k v ) with respect to the v i j 's. h (2) = [(2 5) (6 2)] v (3) = [(3 4) (5 7) (9 12)] Page 105

106 Reparametrization of the edge variables. We have two unknown parameter vectors: h (k h ) and v (k v ). These parameter vectors have unknown dimension, k h =?, k v =? We have a \model selection problem" This justies another detour: model selection. Page 106

107 We have two parameter vectors h (k ) and v (k v ) of unknown h dimension. (k ) h h (k ) v v With this MDL \prior" we can now estimate, 2, h (k ), v (k v ), h most importantly, f. and, Returning to our image restoration problem. The natural description length is L = k h (log M + log N) L = k v (log M + log N) where we are assuming the image size is M N. Page 107

108 Noisy image (b) discontinuity-preserving restoration (c) signaled (a) and (d) restoration without preserving discontinuities. discontinuities Example: discontinuity-preserving restoration Using the MDL prior for the parameters, and the POS criterion. a) c) b) d) Page 108

109 Alternative to explicitly detection/preservation of edges: replace the potentials by \less aggressive" functions. quadratic Discontinuity-preserving restoration: implicit discontinuities Clique-potentials, for rst order auto-model V f(i j) (i j;1)g (f ij f ij;1 ) = '(f ij ; f ij;1 ) V f(i j) (i;1 j)g (f ij f i;1 j ) = '(f ij ; f i;1 j ) where '() is no longer a quadratic function. Several '()'s have been proposed: convex and non-convex. Page 109

110 Generalized Gaussians (Bouman and Sauer [16]), '(x) = jxj p, with 2 [1 2] (for p = 2 ) GMRF). p Discontinuity-preserving restoration: convex potentials Stevenson et al. [80] proposed 8 < '(x) = x2 ( jxj < 2jxj ; 2 ( jxj < : The function proposed by Green [40], '(x) = 2 2 log cosh(x=). Approximately quadratic for small x linear, for large x. Parameter controls the transition between the two behaviors. Page 110

111 Discontinuity-preserving restoration: convex potentials Page 111

112 The one suggested by Hebert and Leahy [45] is = log ; 1 + (x=) 2. '(x) Discontinuity-preserving restoration: non-convex potentials Radically dierent from the quadratic: they atten, for large arguments. Blake and Zisserman's '(x) = (minfjxj g) 2 [15], [30] The one proposed by Geman and McClure [35]: '(x) = x 2 =(x ) Geman and Reynolds [36] proposed: '(x) = jxj=(jxj + ). Page 112

113 Discontinuity-preserving restoration: non-convex potentials Page 113

114 fmap = arg max b f 1 p (g) exp f;u p(fjg)g Z Except in very particular cases (GMRF prior and Gaussian noise) is no analytical solution. there Optimization Problems By far the most common criterion in MRF applications is the MAP. This requires locating the mode(s) of the posterior p(f jg) arg max = f U p (fjg) arg min = f where U p (fjg) is called the a posteriori energy. Finding a MAP estimate is then a dicult task. Page 114

115 1 exp f;u p (fjg)g p Z T! 0, the set of maximizing congurations (denoted 0 ) gets one. Formally probability 1 ( f 2 0 j0j Optimization Problems: Simulated Annealing Notice that U p (fjg) can be multiplied by any positive constant max arg f jg) = arg max p(f f arg max = f 1 p (T ) exp ; U p(fjg) Z T p(fjg T): arg max = f By analogy with the Boltzman distribution, T is called temperature. T! 1, p(fjg T) becomes at: all congurations equiprobable. 8 < p(fjg T) = : lim!0 T 0 ( f 62 0 : Page 115

116 Question: How to simulate a system with equilibrium { p(fjg T)? distribution Question: How to \cool it down" without destroying the { equilibrium? Optimization Problems: Simulated Annealing Simulated annealing (SA): exploits this behavior of p(fjg T) { Simulate a system whose equilibrium distribution is p(fjg T) { \Cool" it until the temperature reaches zero. Implementation issues of SA: Answer: Metropolis algorithm or Gibbs sampler. Answer: later. Page 116

117 T g. G f(t) c be the probability of the candidate conguration c, given Let current f(t). the The Metropolis algorithm. a system with equilibrium distribution Simulating / expf; U(f) p(f T) Starting state f(0) Given the current state f(t), a random \candidate" c is generated. The candidate c is accepted with probability A f(t) c (T ). A(T ) = [A f c (T )] is the acceptance matrix.. The new state f(t + 1) only depends on f(t) this is a Markov chain. Page 117

118 Under certain conditions on G and A(T ), the equilibrium of is distribution = A f 0 f (T ) p(f T) f0 v(t ) where f : A U(f0 ) ; U(f) The Metropolis algorithm X v2 Usual choice A f(t) c (T ) = min exp 1 U(f(t)) ; U(c) T leading to ;U(f) T p(f T) / exp / exp T Page 118

119 Replaces the generation/acceptance mechanism by a simpler one the Markovianity of p(f) exploiting Generate f(t + 1) by replacing f i (t) by a random sample of its probability, with respect to p(f T). All other elements conditional The Gibbs sampler Current state f(t) Choose a site (i.e., an element of f(t)), say f i (t). are unchanged. If every site is visited innitely often, the equilibrium distribution is p(f) / expf; U(f) again g T Page 119

120 The temperature evolves according to T (t), called the \cooling schedule". C + 1) log(t Simulated annealing: cooling The cooling schedule must verify 1X K ; exp = 1 T (t) t=1 where K is a problem-dependent constant Best known case: T (t) = with C K. Page 120

121 The visited site is replaced by the maximizer of its conditional, the current state of its neighbors. given Iterated conditional modes (ICM) algorithm It is a Gibbs sampler at zero temperature. Advantage: extremely fast. Disadvantage: convergence to local maximum. Sometimes, this may not really be a disadvantage. Page 121

122 b fmpm = f 1 Simply simulate (i.e., sample from) p(f jg) using the Gibbs sampler the Metropolis algorithm. or these (estimated) marginal distributions, the MPM is From obtained. easily Implementing the PM and MPM criterion Recall: maximizer of posterior marginals (MPM) p(f m jg) T max arg 1 jg) arg max p(f 1 f 2 jg) arg max p(f m f The posterior mean (PM) b fpm = E[fjg]. Collect statistics: { For the PM, site-wise averages approximate the PM estimate. { For the MPM, collect site-wise histograms These histograms are estimates of the marginals p(f i jg). Page 122

123 MRFs are plagued by the diculty of computing the partition functions. The Partition Function Problem This is specially true for parameter estimation. Few exceptions: GMRF and Ising elds. This issue is dealt with by applying approximation techniques. Page 123

124 This approximation was used in the example shown above on restoration with CGMRF's and MDL priors discontinuity-preserving Approximating the partition function: Pseudo-likelihood Besag's pseudo-likelihood approximation: p(f) ' Y i2n p(f i jf N(i) ) Page 124

125 f;u(f)g exp Z ' Y i2n f;u MF i (f i )g exp MF i Z The quantity U MF i (f i ) is the mean eld local energy: MF i (f i ) = X U Approximating the partition function: Mean eld Imported from statistical physics. The exact function is approximated by a factored version p(f) = V C (f i fe MF [f k ] : k 2 Cg) C: i2c where MF [f k ] = X f k E MF Z f;u MF k (f k )g exp k We replace the neighbors of each site by their (frozen) means. Page 125

126 To obtain U MF i (f i ) we need its neighbors mean values. { symmetrical), thus on U MF i (f i ) itself. are As a consequence, the MF approximation has to be obtained iteratively. Alternative: the mean of each site E MF [f i ] is approximated by the saddle point approximation. mode: Mean eld approximation (cont.) There is a self-referential aspect in the previous equations: { These, in turn, depend on E MF [f i ] (since neighborhood relations Page 126

127 Continuation methods: the objective function U(f jg) is embedded a family in Deterministic optimization: Continuation methods fu(fjg ) 2 [0 1]g such that U(fjg 0) is easily minimizable U(fjg 1) = U(fjg). Procedure: { Find the minimum of U(fjg 0) this is easy { Track that minimum while (slowly) increases up to 1. Page 127

128 A discrete set of n values { 0 = 0 1 ::: t ::: n;1 n = 1g [0 1] is chosen f for each t, U(fjg t ) is minimized by some local iterative { technique. This iterative process is initialized at the previously obtained { for t;1. minimum Writing T = ; log reveals that simulated annealing shares some of spirit of continuation algorithms. the annealing can be called a \stochastic continuation Simulated method". Deterministic optimization: Continuation methods The \tracking" is usually implemented as follows: Page 128

129 fact that these must obtained iteratively is exploited to insert The computation into a continuation method its Continuation methods: Mean eld annealing (MFA) MFA is a deterministic surrogate of (stochastic) simulated annealing. p(fjg T) is replaced by its MF approximation. Computing the MF approximation, nding the MF values. Page 129

130 At (nite) temperature T t, the mean eld values E MF t [f k jg T t ] are As T (t)! 0, the MF approximation converges to a distribution on its global maxima. concentrated Continuation methods: Mean eld annealing (MFA) For T! 1, p(fjg T) and its MF approximation are uniform. The mean eld is trivially obtainable. obtained iteratively. This iterative process is initialized at the previous mean eld values MF t;1[f k jg T t;1 ] E Alternatively, temperature descent is stopped at T = 1. This yields a MF approximation of p(fjg T = 1) = p(fjg) Mean eld values are (approximate) PM estimates. Page 130

131 This is the case of most discontinuity-preserving MRF priors, for g ' 0, the potentials have convex behavior. because The example shown above, of discontinuity-preserving restoration CGMRF and MDL priors, uses this continuation method. with Continuation methods: Simulated tearing (ST) Uses the following family of functions fu(fjg ) = U(fj g) 2 [0 1]g : Obviously, U(fjg 1) = U(fjg) This method is adequate when U(fj0) is easily minimizable. Page 131

132 Important topics not covered: Specic discrete-state MRF's: Ising, auto-logistic, auto-binomial... Multi-scale MRF models. Causal MRF models. Closer look at applications (see references). Page 132

133 Bayesian theory. See accompanying text and the many references Fundamental therein. inference. General concepts: [7], [74]. In computer vision/image Compound recognition: [4], [44], [65]. The multivariate Gaussian case, from a analysis/pattern Random elds on graphs: [41], [42], [71], [79] (and references therein), and [81]. (these are good sources for further references): [19], [62], [83], [42]. See also Books [41]. papers on MRFs for image analysis and computer vision: [11], [18], [20], Inuential [23], [24], [25], [26], [33], [49], [61], [65], [68], [85], [86]. [21], Compound Gauss-Markov Random elds and applications: [28], [48], [49]. [90]. estimation: [6], [55], [28], [39], [50], [56], [64], [67], [84], [91], [94], [5], and Parameter references in [62]. further Some references (this is not an exhaustive list): signal processing perspective: [76]. Markov random elds on regular graphs: Seminal papers: [33], [11]. Earlier work: [82], [75], [85], [86], [9], [52]. Page 133

Markov Random Fields

Markov Random Fields Markov Random Fields Umamahesh Srinivas ipal Group Meeting February 25, 2011 Outline 1 Basic graph-theoretic concepts 2 Markov chain 3 Markov random field (MRF) 4 Gauss-Markov random field (GMRF), and

More information

Some Model Selection Problems. in Image Analysis. Mário A. T. Figueiredo. Instituto Superior Técnico PORTUGAL. Lisboa

Some Model Selection Problems. in Image Analysis. Mário A. T. Figueiredo. Instituto Superior Técnico PORTUGAL. Lisboa Some Model Selection Problems in Image Analysis Mário A. T. Figueiredo Instituto Superior Técnico Lisboa PORTUGAL People involved in work here presented: José M. N. Leitão, Instituto Superior Técnico,

More information

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

More information

Probabilistic Graphical Models Lecture Notes Fall 2009

Probabilistic Graphical Models Lecture Notes Fall 2009 Probabilistic Graphical Models Lecture Notes Fall 2009 October 28, 2009 Byoung-Tak Zhang School of omputer Science and Engineering & ognitive Science, Brain Science, and Bioinformatics Seoul National University

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

CV-NP BAYESIANISM BY MCMC. Cross Validated Non Parametric Bayesianism by Markov Chain Monte Carlo CARLOS C. RODRIGUEZ

CV-NP BAYESIANISM BY MCMC. Cross Validated Non Parametric Bayesianism by Markov Chain Monte Carlo CARLOS C. RODRIGUEZ CV-NP BAYESIANISM BY MCMC Cross Validated Non Parametric Bayesianism by Markov Chain Monte Carlo CARLOS C. RODRIGUE Department of Mathematics and Statistics University at Albany, SUNY Albany NY 1, USA

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Continuous State MRF s

Continuous State MRF s EE64 Digital Image Processing II: Purdue University VISE - December 4, Continuous State MRF s Topics to be covered: Quadratic functions Non-Convex functions Continuous MAP estimation Convex functions EE64

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Undirected Graphical Models: Markov Random Fields

Undirected Graphical Models: Markov Random Fields Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Statistical Learning

Statistical Learning Statistical Learning Lecture 5: Bayesian Networks and Graphical Models Mário A. T. Figueiredo Instituto Superior Técnico & Instituto de Telecomunicações University of Lisbon, Portugal May 2018 Mário A.

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Simultaneous Multi-frame MAP Super-Resolution Video Enhancement using Spatio-temporal Priors

Simultaneous Multi-frame MAP Super-Resolution Video Enhancement using Spatio-temporal Priors Simultaneous Multi-frame MAP Super-Resolution Video Enhancement using Spatio-temporal Priors Sean Borman and Robert L. Stevenson Department of Electrical Engineering, University of Notre Dame Notre Dame,

More information

Notes on Markov Networks

Notes on Markov Networks Notes on Markov Networks Lili Mou moull12@sei.pku.edu.cn December, 2014 This note covers basic topics in Markov networks. We mainly talk about the formal definition, Gibbs sampling for inference, and maximum

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

Discriminative Fields for Modeling Spatial Dependencies in Natural Images

Discriminative Fields for Modeling Spatial Dependencies in Natural Images Discriminative Fields for Modeling Spatial Dependencies in Natural Images Sanjiv Kumar and Martial Hebert The Robotics Institute Carnegie Mellon University Pittsburgh, PA 15213 {skumar,hebert}@ri.cmu.edu

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials) Markov Networks l Like Bayes Nets l Graphical model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

6 Markov Chain Monte Carlo (MCMC)

6 Markov Chain Monte Carlo (MCMC) 6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Introduction to Bayesian methods in inverse problems

Introduction to Bayesian methods in inverse problems Introduction to Bayesian methods in inverse problems Ville Kolehmainen 1 1 Department of Applied Physics, University of Eastern Finland, Kuopio, Finland March 4 2013 Manchester, UK. Contents Introduction

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and

More information

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials) Markov Networks l Like Bayes Nets l Graph model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov

More information

Markov Random Fields

Markov Random Fields Markov Random Fields 1. Markov property The Markov property of a stochastic sequence {X n } n 0 implies that for all n 1, X n is independent of (X k : k / {n 1, n, n + 1}), given (X n 1, X n+1 ). Another

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory. Tim Campbell CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Inversion Base Height. Daggot Pressure Gradient Visibility (miles) Stanford University June 2, 1998 Bayesian Backtting: 1 Bayesian Backtting Trevor Hastie Stanford University Rob Tibshirani University of Toronto Email: trevor@stat.stanford.edu Ftp: stat.stanford.edu:

More information

Bayesian Image Segmentation Using MRF s Combined with Hierarchical Prior Models

Bayesian Image Segmentation Using MRF s Combined with Hierarchical Prior Models Bayesian Image Segmentation Using MRF s Combined with Hierarchical Prior Models Kohta Aoki 1 and Hiroshi Nagahashi 2 1 Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology

More information

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception PROBABILITY DISTRIBUTIONS Credits 2 These slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Parametric Distributions 3 Basic building blocks: Need to determine given Representation:

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Part 8: GLMs and Hierarchical LMs and GLMs

Part 8: GLMs and Hierarchical LMs and GLMs Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course

More information

Intelligent Systems:

Intelligent Systems: Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Machine Learning using Bayesian Approaches

Machine Learning using Bayesian Approaches Machine Learning using Bayesian Approaches Sargur N. Srihari University at Buffalo, State University of New York 1 Outline 1. Progress in ML and PR 2. Fully Bayesian Approach 1. Probability theory Bayes

More information

3 Undirected Graphical Models

3 Undirected Graphical Models Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 3 Undirected Graphical Models In this lecture, we discuss undirected

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA Sistemi di Elaborazione dell Informazione Regressione Ruggero Donida Labati Dipartimento di Informatica via Bramante 65, 26013 Crema (CR), Italy http://homes.di.unimi.it/donida

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information

Markov Chains and MCMC

Markov Chains and MCMC Markov Chains and MCMC Markov chains Let S = {1, 2,..., N} be a finite set consisting of N states. A Markov chain Y 0, Y 1, Y 2,... is a sequence of random variables, with Y t S for all points in time

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Statistics 352: Spatial statistics. Jonathan Taylor. Department of Statistics. Models for discrete data. Stanford University.

Statistics 352: Spatial statistics. Jonathan Taylor. Department of Statistics. Models for discrete data. Stanford University. 352: 352: Models for discrete data April 28, 2009 1 / 33 Models for discrete data 352: Outline Dependent discrete data. Image data (binary). Ising models. Simulation: Gibbs sampling. Denoising. 2 / 33

More information

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling Jon Wakefield Departments of Statistics and Biostatistics University of Washington 1 / 37 Lecture Content Motivation

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks) (Bayesian Networks) Undirected Graphical Models 2: Use d-separation to read off independencies in a Bayesian network Takes a bit of effort! 1 2 (Markov networks) Use separation to determine independencies

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Hierarchical prior models for scalable Bayesian Tomography

Hierarchical prior models for scalable Bayesian Tomography A. Mohammad-Djafari, Hierarchical prior models for scalable Bayesian Tomography, WSTL, January 25-27, 2016, Szeged, Hungary. 1/45. Hierarchical prior models for scalable Bayesian Tomography Ali Mohammad-Djafari

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida Bayesian Statistical Methods Jeff Gill Department of Political Science, University of Florida 234 Anderson Hall, PO Box 117325, Gainesville, FL 32611-7325 Voice: 352-392-0262x272, Fax: 352-392-8127, Email:

More information

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics DS-GA 100 Lecture notes 11 Fall 016 Bayesian statistics In the frequentist paradigm we model the data as realizations from a distribution that depends on deterministic parameters. In contrast, in Bayesian

More information

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs) Machine Learning (ML, F16) Lecture#07 (Thursday Nov. 3rd) Lecturer: Byron Boots Undirected Graphical Models 1 Undirected Graphical Models In the previous lecture, we discussed directed graphical models.

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Undirected Graphical Models

Undirected Graphical Models Undirected Graphical Models 1 Conditional Independence Graphs Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A, B, and C be subsets of vertices. We say that C separates

More information

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling Christopher Jennison Department of Mathematical Sciences, University of Bath, UK http://people.bath.ac.uk/mascj Adriana Ibrahim Institute

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses Bayesian Learning Two Roles for Bayesian Methods Probabilistic approach to inference. Quantities of interest are governed by prob. dist. and optimal decisions can be made by reasoning about these prob.

More information

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book) Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book) Principal Topics The approach to certainty with increasing evidence. The approach to consensus for several agents, with increasing

More information

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 Plan for today and next week Today and next time: Bayesian networks (Bishop Sec. 8.1) Conditional

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Probability and Estimation. Alan Moses

Probability and Estimation. Alan Moses Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Bayesian Paradigm. Maximum A Posteriori Estimation

Bayesian Paradigm. Maximum A Posteriori Estimation Bayesian Paradigm Maximum A Posteriori Estimation Simple acquisition model noise + degradation Constraint minimization or Equivalent formulation Constraint minimization Lagrangian (unconstraint minimization)

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed

More information

Classical and Bayesian inference

Classical and Bayesian inference Classical and Bayesian inference AMS 132 Claudia Wehrhahn (UCSC) Classical and Bayesian inference January 8 1 / 11 The Prior Distribution Definition Suppose that one has a statistical model with parameter

More information

CSCE 478/878 Lecture 6: Bayesian Learning

CSCE 478/878 Lecture 6: Bayesian Learning Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 143 Part IV

More information

Advances and Applications in Perfect Sampling

Advances and Applications in Perfect Sampling and Applications in Perfect Sampling Ph.D. Dissertation Defense Ulrike Schneider advisor: Jem Corcoran May 8, 2003 Department of Applied Mathematics University of Colorado Outline Introduction (1) MCMC

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information