CSE 559A: Computer Vision Tomorrow Zhihao's Office Hours back in Jolley 309: 10:30am-Noon

CSE 559A: Computer Vision ADMINISTRIVIA Tomorrow Zhihao's Office Hours back in Jolley 309: 0:30am-Noon Fall 08: T-R: :30-pm @ Lopata 0 This Friday: Regular Office Hours Net Friday: Recitation for PSET Try all problems before coming to recitation Monday Office Hours again in Jolley 7 Instructor: Ayan Chakrabarti (ayan@wustl.edu. Course Staff: Zhihao ia, Charlie Wu, Han Liu http://www.cse.wustl.edu/~ayan/courses/cse559a/ Sep 3, 08 Setting There is a true image (or image-like object What we have is some degraded observation that we want to estimate that is based on The degradation might be "stochastic": so we say p(y Simplest Case (single piel: : Estimate from y: "denoising", y R y y = + σϵ p(y = (y; μ =, p(y is our observation model, called the likelihood function Likelihood of observation given true image Maimum Likelihood Estimator: = arg ma p(y For p(y = (y;,, what is ML Estimate? = y That's a little disappointing. y p(y = ep ( π σ (y

0 What we really want to do is maimize p( y, not the likelihood p(y Bayes Rule p(y is the distribution of observation y given true image p( y is the distribution of true given observation y Simple Derivation p( y = p(y p( p(y p(y p( : Likelihood : "Prior" Distribution on p( y = p(y p( p(y p( d What we believe about before we see the observation p( y : Posterior Distribution of given observation y What we believe about a er we see the observation p(, y = p(p(y = p(yp( y p( y = p(y p( p(y = p(y p( d p(y p( y = p(y p( p(y p( d p( y = p(y p( p(y p( d Maimum A Posteriori (MAP Estimation Maimum A Posteriori (MAP Estimation The most likely answer under the posterior distribution Other Estimators Possible: Try to minimize some "risk" or loss function Measures how bad an answer Minimize Epected Risk Under Posterior Let's say L(, = (. What is? = arg ma p( y is when true value is = arg min L(, p( yd L(, How do we compute this? = arg ma p( y = arg ma p( y = arg ma p(y p( p(y p( d = arg ma p(y p( = arg ma log[p(y p(] = arg ma log p(y + log p( = arg min log p(y log p( p( y = = p( yd (Turn this into minimization of a cost

Maimum A Posteriori (MAP Estimation Maimum A Posteriori (MAP Estimation Back to Denoising (y ( 0.5 = arg min + C + + C How do you compute? = arg min + (y ( 0.5 (y p(y = ep ( πσ Let's choose a simple prior: p( = (; 0.5, In this case, simpler option available. Complete the squares, epress as a single quadratic. More generally, to minimize C(, find C ( = 0, check C ( > 0 What is C (? C y ( = + 0.5 ( Maimum A Posteriori (MAP Estimation (y ( 0.5 = arg min + y C ( = + 0.5 = 0 y + 0.5σ = + σ C ( = + > 0 Maimum A Posteriori (MAP Estimation Weighted sum of observation and prior mean Closer to prior mean when (y = arg min σ ( 0.5 + is high y + 0.5σ = + σ Means the function is conve (quadratic functions with a positive coefficient for are conve

Let's go beyond "single piel images": Y [n] = [n] + ϵ[n] If noise is independent at each piel p({y[n]} {[n]} = p(y[n] [n] = (Y [n]; [n], σ n n Similarly, if prior p({[n]} is defined to model piels independently: Product turns into sum a er taking log p({[n]} = ([n]; 0.5, n (Y[n] [n] ([n] 0.5 [n] = arg min = + {[n]} Note: This is a minimization over multiple variables n But since the different variables don't interact with each other, we can optimize each piel separately Y[n] + 0.5 [n] = n + Multi-variate Gaussians Re-interpret these as multi-variate Gaussians For R d : (π d det(σ p( = ep ( ( μ ( μ Here, μ R d is a mean "vector", same size as Co-variance Σ is a symmetric, positive-definite matri d d matri When the different elements of are independent, off-diagonal elements of Σ are 0. T Σ Multi-variate Gaussians Represent and Y as vectorized images Here the mean-vector for Y is p(y = (Y ;, σ I The co-variance matri is a diagonal matri with all diagonal entries = σ How do we represent our prior and likelihood as a multi-variate Gaussian?

Multi-variate Gaussians μ =, Σ = I det(σ = ( d Σ = σ I (π d det(σ p(y = ep ( (Y μ (Y μ T Σ T T Σ (Y μ (Y μ = (Y I(Y = (Y I (Y = (Y (Y = Y T = (Y [n] [n] T min But now, we can use these multi-variate distributions to model correlations and interactions between different piels Let's stay with denoising, but talk about defining a prior in the Wavelet domain Instead of saying, individual piel values are independent, let's say individual wavelet coefficients are independent Let's also put different means and variances on different wavelet coefficients All the 'derivative' coefficients have zero mean Variance goes up as we go to coarser levels p(y = (Y;, σ I p( = (Y; 0.5I, I = arg Y + 0.5 σ n Let C = W, where W represents the Wavelet transform matri W is unitary, = = W T C W Define our prior on C: W T p(c = (C; μ c, D c Change of Variables We have probability distribution on a random variable U : p U (U U = f (V is a one-to-one function p V (V = p U (f (V? Realize, that p( are densities. So we need to account for "scaling" of the probability measure. Now, D c is a diagonal matri, with entries equal to corresponding variances of coefficients Off-diagonal elements 0 implies Wavelet co-efficients are un-correlated A prior on C implies a prior on C is just a different representation for where det( J f ij = dui dvj p (UdU U = p (f (V du U du = det( J f dv So, p V (V = p U (f (Vdet( J f If U = f (V = A V is linear, J = A. If A is unitary, det(a =.

Let C = W, where W represents the Wavelet transform matri Now, let's denoise with this prior. p(c = (C; μ c, D c p( = ep ( (W μ (W c T D c μ c (π det( d D c ep ( ( W T μ W ( c T W T D c W T μ c = (; W T μ c, W T D c W where, μ = W T μ c, Σ = W T D c W. (We're assuming noise variance is. How do we minimize this? min = arg Y + ( μ ( μ Take derivative, set to 0. Check second derivative is positive. Take gradient (vector derivative, set to 0. Check that "Hessian" is positive-definite. y = f ( is a scalar valued function of a vector R d. T Σ The gradient is a vector same sized as, with each entry being the partial derivative of y with respect to that element of. y y y = y 3 y = f ( is a scalar valued function of a vector R d. The gradient is a vector same sized as, with each entry being the partial derivative of y with respect to that element of. The Hessian is a matri of size d d for R d : y y y = y 3 ( H y ij = y i j Properties of a multi-variate quadratic form where Q is a symmetric d d matri, R is a d-dimensional vector, S is a scalar. Note that this is a scalar value Comes from the following identities T A = (A + A T T R = R T = R T Q T R + S = Q R The Hessian is simply given by Q (it is constant, doesn't depend on If Q is positive definite, then = 0 gives us the unique minimizer. Q = 0 Q R = 0 = R

where, μ = W T μ c, Σ = W T D c W. Q = (I + Σ min = arg Y + ( μ ( μ is positive definite (sum of two positive-definite matrices is positive-definite. i.e., taking inverse in the original de-correlated wavelet domain. i.e., inverse-wavelet transform of the scaled wavelet coefficients. T Σ = arg min (I + (Y + μ + ( Y + μ T Σ T Σ μ T Σ = (I + Σ (Y + Σ μ Σ = ( W T D c W = W D c W T = W T D c W Σ μ = W T D c Wμ = W T D c W W T μ c = W T D c μ c = arg min T (I + Σ T (Y + Σ μ + ( Y + μ T Σ μ = (I + Σ (Y + Σ μ Σ = W T D c W, Σ μ = W T D c Note that we can't actually form I + Σ, let alone invert it: d d matri, with d = total number of piels. I + D c I + Σ = I + W T D c W = W T W + W T D c W = W T IW + W T D c W = W T (I + D c W (I + Σ = W T (I + D c W is also diagonal, so it's inverse is just inverting diagonal elements. = W T (I + D c W (Y + W T D c μ c μ c Let's call C = W, = W Y. = arg min T (I + Σ T (Y + Σ μ + ( Y + μ T Σ μ C y = W T (I + D c W (Y + W T D c μ c W = W W T (I + D c (W Y + W W T D c μc C = (I + D c ( + C y D c μc So what we did is that we "denoised" in the wavelet domain, with the noise variance in the wavelet domain also being equal to. Y = + ϵ C y = C + W ϵ. W ϵ is also a random vector with 0 mean, and covariance = WI W T = I. So far, we've been in a Bayesian setting with a prior. = arg min log p(y log p( We're balancing off fidelity to observation (first term, likelihood, and what we believe about (second term, prior But we need p( to be 'a proper' probability distribution. Sometimes, that's too restrictive. Instead just think of these as a "data term" (still comes from observation model, and have a generic "regularization" term. Here, G is a vector corresponding to the image we get by convolving with the Gaussian -derivative filter. = arg min D(; Y + R( = arg Y + λ ( + min G G y

Data term still has the same form as for denoising. But now, regularization just says we want the and y spatial derivatives of our image to be small Encodes a prior that natural images are smooth. This isn't a prior. Because gradients aren't a "representation" of. But better, because they don't enforce smoothness at alternate gradient locations. How do we minimize this? = arg Y + λ ( + min G G y