CSE 559A: Computer Vision Fall 208: T-R: :30-pm @ Lopata 0 Instructor: Ayan Chakrabarti (ayan@wustl.edu). Course Staff: Zhihao ia, Charlie Wu, Han Liu http://www.cse.wustl.edu/~ayan/courses/cse559a/ Sep 3, 208
ADMINISTRIVIA Tomorrow Zhihao's Office Hours back in Jolley 309: 0:30am-Noon This Friday: Regular Office Hours Net Friday: Recitation for PSET Try all problems before coming to recitation Monday Office Hours again in Jolley 27
Statistics / Estimation Recap Setting There is a true image (or image-like object) What we have is some degraded observation The degradation might be "stochastic": so we say Simplest Case (single piel: ): Estimate from y: "denoising", y R that we want to estimate y that is based on p(y ) y = + σϵ p(y ) = (y; μ =, σ 2 )
ma Statistics / Estimation Recap p(y ) is our observation model, called the likelihood function Likelihood of observation Maimum Likelihood Estimator: given true image For p(y ) = (y;, σ 2 ), what is ML Estimate? y = arg p(y ) p(y ) = ep ( 2σ 2 ) 2πσ (y ) 2 = y That's a little disappointing.
Statistics / Estimation Recap What we really want to do is maimize Bayes Rule Simple Derivation is the distribution of observation is the distribution of true p( y), not the likelihood given true image p(y ) y given observation p( y) y p( y) = p(y )p() p(y) p(y ) p(, y) = p()p(y ) = p(y)p( y) p( y) = p(y )p() p(y) p(y) = p(y )p( )d
Statistics / Estimation Recap p( y) = p(y )p() p(y )p( )d p(y ) p() : Likelihood : "Prior" Distribution on What we believe about before we see the observation p( y) : Posterior Distribution of given observation y What we believe about a er we see the observation
Statistics / Estimation Recap p( y) = p(y )p() p(y )p( )d Maimum A Posteriori (MAP) Estimation = arg ma p( y) The most likely answer under the posterior distribution Other Estimators Possible: Try to minimize some "risk" or loss function Measures how bad an answer Minimize Epected Risk Under Posterior is when true value is L(, ) = arg min L(, )p( y)d Let's say L(, ) = ( ) 2. What is? p( y) = = p( y)d
Statistics / Estimation Recap p( y) = p(y )p() p(y )p( )d Maimum A Posteriori (MAP) Estimation = arg ma p( y) How do we compute this? = arg ma p( y) = arg ma p(y )p() p(y )p( )d (Turn this into minimization of a cost) = arg ma p(y )p() = arg ma log[p(y )p()] = arg ma log p(y ) + log p() = arg min log p(y ) log p()
Statistics / Estimation Recap Maimum A Posteriori (MAP) Estimation (y ) 2 2 ( 0.5) = arg min + C + + 2σ 2 2 C Back to Denoising p(y ) = ep ( 2σ ) 2 2πσ (y ) 2 Let's choose a simple prior: p() = (; 0.5, )
) Statistics / Estimation Recap Maimum A Posteriori (MAP) Estimation How do you compute? (y ) 2 ( 0.5) = arg min + 2σ 2 2 2 In this case, simpler option available. Complete the squares, epress as a single quadratic 2. More generally, to minimize C(), find () = 0, check What is C ()? C C () > 0 C y () = + 0.5 σ 2 (
Statistics / Estimation Recap Maimum A Posteriori (MAP) Estimation (y ) 2 ( 0.5) = arg min + 2σ 2 2 C y () = + 0.5 = 0 σ 2 2 = y + 0.5σ 2 + σ 2 C () = + > 0 σ 2 Means the function is conve (quadratic functions with a positive coefficient for 2 are conve)
Statistics / Estimation Recap Maimum A Posteriori (MAP) Estimation (y ) 2 ( 0.5) = arg min + 2σ 2 2 2 Weighted sum of observation and prior mean Closer to prior mean when σ 2 is high = y + 0.5σ 2 + σ 2
Statistics / Estimation Recap Let's go beyond "single piel images": If noise is independent at each piel Y [n] = [n] + ϵ[n] p({y [n]} {[n]}) = p(y [n] [n]) = (Y [n]; [n], ) n n σ 2 Similarly, if prior p({[n]}) is defined to model piels independently: p({[n]}) = ([n]; 0.5, ) n Product turns into sum a er taking log
Statistics / Estimation Recap (Y[n] [n]) 2 ([n] 0.5) [n] = arg min = + {[n]} 2σ 2 2 n 2 Note: This is a minimization over multiple variables But since the different variables don't interact with each other, we can optimize each piel separately [n] = n Y[n] + 0.5σ 2 + σ 2
Multi-variate Gaussians Re-interpret these as multi-variate Gaussians For R d : Here, μ R d Σ p() = ep ( ( μ ( μ) ) is a mean "vector", same size as Co-variance is a symmetric, positive-definite matri d d matri When the different elements of are independent, off-diagonal elements of are 0. How do we represent our prior and likelihood as a multi-variate Gaussian? (2π ) d det(σ) 2 ) T Σ Σ
Multi-variate Gaussians Represent and Y as vectorized images p(y ) = (Y ;, σ 2 I) Here the mean-vector for Y is The co-variance matri is a diagonal matri with all diagonal entries = σ 2
Multi-variate Gaussians, μ = Σ = σ 2 I det(σ) = (σ 2 ) d ) d p(y ) = ep ( (Y μ (Y μ) ) (2π det(σ) 2 ) T Σ Σ = I σ 2 ) T Σ ) T (Y μ (Y μ) = (Y I(Y ) = (Y I (Y ) σ 2 σ 2 = (Y ) (Y ) = Y T 2 σ 2 σ 2 = (Y [n] [n] σ 2 n ) 2 ) T
But now, we can use these multi-variate distributions to model correlations and interactions between different piels Let's stay with denoising, but talk about defining a prior in the Wavelet domain Instead of saying, individual piel values are independent, let's say individual wavelet coefficients are independent Let's also put different means and variances on different wavelet coefficients All the 'derivative' coefficients have zero mean Variance goes up as we go to coarser levels p(y ) = (Y;, σ 2 I) p() = (Y; 0.5I, I) = arg Y + 0.5 min σ 2 2 2
Let C = W, where W represents the Wavelet transform matri is unitary, W W = = W T C W T Define our prior on C: Now, D c is a diagonal matri, with entries equal to corresponding variances of coefficients Off-diagonal elements A prior on C implies a prior on 0 C is just a different representation for p(c) = (C; μ c, D c ) implies Wavelet co-efficients are un-correlated
Change of Variables We have probability distribution on a random variable U : U = f (V ) Realize, that is a one-to-one function p( ) (U) p U p V (V ) = p U (f (V ))? are densities. So we need to account for "scaling" of the probability measure. p U (U)dU = p U (f (V ))du du = det( J f )dv where det( J f ) ij = du i dv j So, p V (V ) = p U (f (V ))det( J f ) If U = f (V ) = A V is linear, J = A. If A is unitary, det(a) =.
Let C = W, where W represents the Wavelet transform matri Now, let's denoise with this prior. p(c) = (C; μ c, D c ) p() = ep ( (W μ c ) T D (W c μ c ) ) (2π) d det( D c ) ep ( ( W T μ c ) T W T D W ( c W T μ c ) ) 2 = (; W T μ c, W T D c W ) 2
= arg min Y + ( μ ) T Σ ( μ) where, μ = W T μ c, Σ = W T D c W. (We're assuming noise variance is ). How do we minimize this? Take derivative, set to 0. Check second derivative is positive. Take gradient (vector derivative), set to 0. Check that "Hessian" is positive-definite. y = f () is a scalar valued function of a vector R d. 2 The gradient is a vector same sized as, with each entry being the partial derivative of y with respect to that element of. y = y y 2 y 3
y = f () is a scalar valued function of a vector R d. The gradient is a vector same sized as, with each entry being the partial derivative of y with respect to that element of. y = y y 2 y 3 The Hessian is a matri of size d d for R d : ( H y ) ij = 2 y i j
Properties of a multi-variate quadratic form Q 2 R + S where Q is a symmetric d d matri, R is a d-dimensional vector, S is a scalar. Note that this is a scalar value T T = 2Q 2R Comes from the following identities T A = (A + A T ) T R = R T = R The Hessian is simply given by Q (it is constant, doesn't depend on ) If Q is positive definite, then = 0 gives us the unique minimizer. Q = 0 2Q 2R = 0 = R
where, μ = W T μ c, Σ = W T D c W. Q = (I + Σ ) min = arg Y + ( μ ( μ) is positive definite (sum of two positive-definite matrices is positive-definite). i.e., taking inverse in the original de-correlated wavelet domain. i.e., inverse-wavelet transform of the scaled wavelet coefficients. 2 ) T Σ = arg min (I + ) 2 (Y + μ) + ( Y + μ ) T Σ T Σ 2 μ T Σ = (I + Σ ) (Y + Σ μ) Σ = ( W T D c W ) = W Dc W T = W T Dc W Σ μ = W T Dc Wμ = W T Dc W W T μ c = W T Dc μ c
= arg min T (I + Σ ) 2 T (Y + Σ μ) + ( Y 2 + μ T Σ μ ) = (I + Σ ) (Y + Σ μ) Σ = W T Dc W, Σ μ = W T Dc μ c Note that we can't actually form I + Σ, let alone invert it: d d matri, with d = total number of piels. I + D c Σ I + = I + W T D W = W + W T Dc W = W T IW + W T Dc W = W T (I + Dc )W W T (I + Σ ) = W T (I + D ) W is also diagonal, so it's inverse is just inverting diagonal elements. W T D = (I + c ) W (Y + W T Dc ) c c μ c
Let's call = W,. = arg min T (I + Σ ) 2 T (Y + Σ μ) + ( Y 2 + μ T Σ μ ) So what we did is that we "denoised" in the wavelet domain, with the noise variance in the wavelet domain also being equal to. Y = + ϵ C = WY C y C y = C + W ϵ. W T D = (I + c ) W (Y + W T Dc ) W = W W T (I + Dc ) (WY + WW T D c μ c ) C = (I + Dc ) ( + C y D c μ c ) is also a random vector with 0 mean, and covariance. W ϵ = WI = I W T μ c
So far, we've been in a Bayesian setting with a prior. = arg min log p(y ) log p() We're balancing off fidelity to observation (first term, likelihood), and what we believe about term, prior) But we need p() to be 'a proper' probability distribution. Sometimes, that's too restrictive. (second Instead just think of these as a "data term" (still comes from observation model), and have a generic "regularization" term. = arg min D(; Y) + R() = arg Y + λ ( + min 2 G 2 G y 2 ) Here, G is a vector corresponding to the image we get by convolving with the Gaussian -derivative filter.
= arg Y + λ ( + min 2 G 2 G y 2 ) Data term still has the same form as for denoising. But now, regularization just says we want the and y spatial derivatives of our image to be small Encodes a prior that natural images are smooth. This isn't a prior. Because gradients aren't a "representation" of. But better, because they don't enforce smoothness at alternate gradient locations. How do we minimize this?