CSE 559A: Computer Vision

Similar documents
CSE 559A: Computer Vision Tomorrow Zhihao's Office Hours back in Jolley 309: 10:30am-Noon

PSET 0 Due Today by 11:59pm Any issues with submissions, post on Piazza. Fall 2018: T-R: Lopata 101. Median Filter / Order Statistics

CSE 559A: Computer Vision EVERYONE needs to fill out survey.

CSE 559A: Computer Vision

GENERAL. CSE 559A: Computer Vision AUTOGRAD AUTOGRAD

Bayesian Paradigm. Maximum A Posteriori Estimation

STA 414/2104, Spring 2014, Practice Problem Set #1

ADMINISTRIVIA SENSOR

2 Statistical Estimation: Basic Concepts

Machine Learning Lecture 2

Machine Learning Lecture 2

Expectation Maximization Deconvolution Algorithm

Machine Learning (CS 567) Lecture 5

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Notes on Discriminant Functions and Optimal Classification

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

Mon Feb Matrix algebra and matrix inverses. Announcements: Warm-up Exercise:

5. Discriminant analysis

Probabilistic & Unsupervised Learning

ASummaryofGaussianProcesses Coryn A.L. Bailer-Jones

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Basic concepts in estimation

An example to illustrate frequentist and Bayesian approches

Where now? Machine Learning and Bayesian Inference

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

September 12, Math Analysis Ch 1 Review Solutions. #1. 8x + 10 = 4x 30 4x 4x 4x + 10 = x = x = 10.

Machine Learning 4771

L20: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs

Lecture 3: Pattern Classification. Pattern classification

ECE521 week 3: 23/26 January 2017

Bayesian Interpretations of Regularization

Naïve Bayes classification

Naive Bayes and Gaussian Bayes Classifier

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Marginal density. If the unknown is of the form x = (x 1, x 2 ) in which the target of investigation is x 1, a marginal posterior density

MACHINE LEARNING ADVANCED MACHINE LEARNING

Algorithmisches Lernen/Machine Learning

Introduction to Machine Learning

Machine Learning Lecture 3

GAUSSIAN PROCESS REGRESSION

Lecture 2: Parameter Learning in Fully Observed Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CPSC 340: Machine Learning and Data Mining

Statistical learning. Chapter 20, Sections 1 3 1

Nonparametric Bayesian Methods (Gaussian Processes)

Frequentist-Bayesian Model Comparisons: A Simple Example

State Space and Hidden Markov Models

Bayes Decision Theory - I

Matrix Dimensions(orders)

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Mobile Robot Localization

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Naive Bayes and Gaussian Bayes Classifier

Overfitting, Bias / Variance Analysis

Lecture Note 2: Estimation and Information Theory

Recursive Estimation

Naive Bayes and Gaussian Bayes Classifier

Statistical learning. Chapter 20, Sections 1 4 1

Noise-Blind Image Deblurring Supplementary Material

Probability and Estimation. Alan Moses

1 Kernel methods & optimization

Announcements Wednesday, November 01

Likelihood Bounds for Constrained Estimation with Uncertainty

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

y Xw 2 2 y Xw λ w 2 2

1 Bayesian Linear Regression (BLR)

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Introduction to Machine Learning

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Probabilistic Graphical Models

Linear Regression (9/11/13)

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Derivation of the Kalman Filter

Color Scheme. swright/pcmi/ M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Why do we care? Examples. Bayes Rule. What room am I in? Handling uncertainty over time: predicting, estimating, recognizing, learning

Machine Learning for Signal Processing Bayes Classification

CS 4495 Computer Vision Principle Component Analysis

Lecture : Probabilistic Machine Learning

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Announcements Wednesday, November 01

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Covariance Matrix Simplification For Efficient Uncertainty Management

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Parameter estimation Conditional risk

Bayesian Decision Theory

Bayesian Methods / G.D. Hager S. Leonard

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Linear Regression (continued)

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Let's transfer our results for conditional probability for events into conditional probabilities for random variables.

MACHINE LEARNING ADVANCED MACHINE LEARNING

Choosing among models

Transcription:

CSE 559A: Computer Vision Fall 208: T-R: :30-pm @ Lopata 0 Instructor: Ayan Chakrabarti (ayan@wustl.edu). Course Staff: Zhihao ia, Charlie Wu, Han Liu http://www.cse.wustl.edu/~ayan/courses/cse559a/ Sep 3, 208

ADMINISTRIVIA Tomorrow Zhihao's Office Hours back in Jolley 309: 0:30am-Noon This Friday: Regular Office Hours Net Friday: Recitation for PSET Try all problems before coming to recitation Monday Office Hours again in Jolley 27

Statistics / Estimation Recap Setting There is a true image (or image-like object) What we have is some degraded observation The degradation might be "stochastic": so we say Simplest Case (single piel: ): Estimate from y: "denoising", y R that we want to estimate y that is based on p(y ) y = + σϵ p(y ) = (y; μ =, σ 2 )

ma Statistics / Estimation Recap p(y ) is our observation model, called the likelihood function Likelihood of observation Maimum Likelihood Estimator: given true image For p(y ) = (y;, σ 2 ), what is ML Estimate? y = arg p(y ) p(y ) = ep ( 2σ 2 ) 2πσ (y ) 2 = y That's a little disappointing.

Statistics / Estimation Recap What we really want to do is maimize Bayes Rule Simple Derivation is the distribution of observation is the distribution of true p( y), not the likelihood given true image p(y ) y given observation p( y) y p( y) = p(y )p() p(y) p(y ) p(, y) = p()p(y ) = p(y)p( y) p( y) = p(y )p() p(y) p(y) = p(y )p( )d

Statistics / Estimation Recap p( y) = p(y )p() p(y )p( )d p(y ) p() : Likelihood : "Prior" Distribution on What we believe about before we see the observation p( y) : Posterior Distribution of given observation y What we believe about a er we see the observation

Statistics / Estimation Recap p( y) = p(y )p() p(y )p( )d Maimum A Posteriori (MAP) Estimation = arg ma p( y) The most likely answer under the posterior distribution Other Estimators Possible: Try to minimize some "risk" or loss function Measures how bad an answer Minimize Epected Risk Under Posterior is when true value is L(, ) = arg min L(, )p( y)d Let's say L(, ) = ( ) 2. What is? p( y) = = p( y)d

Statistics / Estimation Recap p( y) = p(y )p() p(y )p( )d Maimum A Posteriori (MAP) Estimation = arg ma p( y) How do we compute this? = arg ma p( y) = arg ma p(y )p() p(y )p( )d (Turn this into minimization of a cost) = arg ma p(y )p() = arg ma log[p(y )p()] = arg ma log p(y ) + log p() = arg min log p(y ) log p()

Statistics / Estimation Recap Maimum A Posteriori (MAP) Estimation (y ) 2 2 ( 0.5) = arg min + C + + 2σ 2 2 C Back to Denoising p(y ) = ep ( 2σ ) 2 2πσ (y ) 2 Let's choose a simple prior: p() = (; 0.5, )

) Statistics / Estimation Recap Maimum A Posteriori (MAP) Estimation How do you compute? (y ) 2 ( 0.5) = arg min + 2σ 2 2 2 In this case, simpler option available. Complete the squares, epress as a single quadratic 2. More generally, to minimize C(), find () = 0, check What is C ()? C C () > 0 C y () = + 0.5 σ 2 (

Statistics / Estimation Recap Maimum A Posteriori (MAP) Estimation (y ) 2 ( 0.5) = arg min + 2σ 2 2 C y () = + 0.5 = 0 σ 2 2 = y + 0.5σ 2 + σ 2 C () = + > 0 σ 2 Means the function is conve (quadratic functions with a positive coefficient for 2 are conve)

Statistics / Estimation Recap Maimum A Posteriori (MAP) Estimation (y ) 2 ( 0.5) = arg min + 2σ 2 2 2 Weighted sum of observation and prior mean Closer to prior mean when σ 2 is high = y + 0.5σ 2 + σ 2

Statistics / Estimation Recap Let's go beyond "single piel images": If noise is independent at each piel Y [n] = [n] + ϵ[n] p({y [n]} {[n]}) = p(y [n] [n]) = (Y [n]; [n], ) n n σ 2 Similarly, if prior p({[n]}) is defined to model piels independently: p({[n]}) = ([n]; 0.5, ) n Product turns into sum a er taking log

Statistics / Estimation Recap (Y[n] [n]) 2 ([n] 0.5) [n] = arg min = + {[n]} 2σ 2 2 n 2 Note: This is a minimization over multiple variables But since the different variables don't interact with each other, we can optimize each piel separately [n] = n Y[n] + 0.5σ 2 + σ 2

Multi-variate Gaussians Re-interpret these as multi-variate Gaussians For R d : Here, μ R d Σ p() = ep ( ( μ ( μ) ) is a mean "vector", same size as Co-variance is a symmetric, positive-definite matri d d matri When the different elements of are independent, off-diagonal elements of are 0. How do we represent our prior and likelihood as a multi-variate Gaussian? (2π ) d det(σ) 2 ) T Σ Σ

Multi-variate Gaussians Represent and Y as vectorized images p(y ) = (Y ;, σ 2 I) Here the mean-vector for Y is The co-variance matri is a diagonal matri with all diagonal entries = σ 2

Multi-variate Gaussians, μ = Σ = σ 2 I det(σ) = (σ 2 ) d ) d p(y ) = ep ( (Y μ (Y μ) ) (2π det(σ) 2 ) T Σ Σ = I σ 2 ) T Σ ) T (Y μ (Y μ) = (Y I(Y ) = (Y I (Y ) σ 2 σ 2 = (Y ) (Y ) = Y T 2 σ 2 σ 2 = (Y [n] [n] σ 2 n ) 2 ) T

But now, we can use these multi-variate distributions to model correlations and interactions between different piels Let's stay with denoising, but talk about defining a prior in the Wavelet domain Instead of saying, individual piel values are independent, let's say individual wavelet coefficients are independent Let's also put different means and variances on different wavelet coefficients All the 'derivative' coefficients have zero mean Variance goes up as we go to coarser levels p(y ) = (Y;, σ 2 I) p() = (Y; 0.5I, I) = arg Y + 0.5 min σ 2 2 2

Let C = W, where W represents the Wavelet transform matri is unitary, W W = = W T C W T Define our prior on C: Now, D c is a diagonal matri, with entries equal to corresponding variances of coefficients Off-diagonal elements A prior on C implies a prior on 0 C is just a different representation for p(c) = (C; μ c, D c ) implies Wavelet co-efficients are un-correlated

Change of Variables We have probability distribution on a random variable U : U = f (V ) Realize, that is a one-to-one function p( ) (U) p U p V (V ) = p U (f (V ))? are densities. So we need to account for "scaling" of the probability measure. p U (U)dU = p U (f (V ))du du = det( J f )dv where det( J f ) ij = du i dv j So, p V (V ) = p U (f (V ))det( J f ) If U = f (V ) = A V is linear, J = A. If A is unitary, det(a) =.

Let C = W, where W represents the Wavelet transform matri Now, let's denoise with this prior. p(c) = (C; μ c, D c ) p() = ep ( (W μ c ) T D (W c μ c ) ) (2π) d det( D c ) ep ( ( W T μ c ) T W T D W ( c W T μ c ) ) 2 = (; W T μ c, W T D c W ) 2

= arg min Y + ( μ ) T Σ ( μ) where, μ = W T μ c, Σ = W T D c W. (We're assuming noise variance is ). How do we minimize this? Take derivative, set to 0. Check second derivative is positive. Take gradient (vector derivative), set to 0. Check that "Hessian" is positive-definite. y = f () is a scalar valued function of a vector R d. 2 The gradient is a vector same sized as, with each entry being the partial derivative of y with respect to that element of. y = y y 2 y 3

y = f () is a scalar valued function of a vector R d. The gradient is a vector same sized as, with each entry being the partial derivative of y with respect to that element of. y = y y 2 y 3 The Hessian is a matri of size d d for R d : ( H y ) ij = 2 y i j

Properties of a multi-variate quadratic form Q 2 R + S where Q is a symmetric d d matri, R is a d-dimensional vector, S is a scalar. Note that this is a scalar value T T = 2Q 2R Comes from the following identities T A = (A + A T ) T R = R T = R The Hessian is simply given by Q (it is constant, doesn't depend on ) If Q is positive definite, then = 0 gives us the unique minimizer. Q = 0 2Q 2R = 0 = R

where, μ = W T μ c, Σ = W T D c W. Q = (I + Σ ) min = arg Y + ( μ ( μ) is positive definite (sum of two positive-definite matrices is positive-definite). i.e., taking inverse in the original de-correlated wavelet domain. i.e., inverse-wavelet transform of the scaled wavelet coefficients. 2 ) T Σ = arg min (I + ) 2 (Y + μ) + ( Y + μ ) T Σ T Σ 2 μ T Σ = (I + Σ ) (Y + Σ μ) Σ = ( W T D c W ) = W Dc W T = W T Dc W Σ μ = W T Dc Wμ = W T Dc W W T μ c = W T Dc μ c

= arg min T (I + Σ ) 2 T (Y + Σ μ) + ( Y 2 + μ T Σ μ ) = (I + Σ ) (Y + Σ μ) Σ = W T Dc W, Σ μ = W T Dc μ c Note that we can't actually form I + Σ, let alone invert it: d d matri, with d = total number of piels. I + D c Σ I + = I + W T D W = W + W T Dc W = W T IW + W T Dc W = W T (I + Dc )W W T (I + Σ ) = W T (I + D ) W is also diagonal, so it's inverse is just inverting diagonal elements. W T D = (I + c ) W (Y + W T Dc ) c c μ c

Let's call = W,. = arg min T (I + Σ ) 2 T (Y + Σ μ) + ( Y 2 + μ T Σ μ ) So what we did is that we "denoised" in the wavelet domain, with the noise variance in the wavelet domain also being equal to. Y = + ϵ C = WY C y C y = C + W ϵ. W T D = (I + c ) W (Y + W T Dc ) W = W W T (I + Dc ) (WY + WW T D c μ c ) C = (I + Dc ) ( + C y D c μ c ) is also a random vector with 0 mean, and covariance. W ϵ = WI = I W T μ c

So far, we've been in a Bayesian setting with a prior. = arg min log p(y ) log p() We're balancing off fidelity to observation (first term, likelihood), and what we believe about term, prior) But we need p() to be 'a proper' probability distribution. Sometimes, that's too restrictive. (second Instead just think of these as a "data term" (still comes from observation model), and have a generic "regularization" term. = arg min D(; Y) + R() = arg Y + λ ( + min 2 G 2 G y 2 ) Here, G is a vector corresponding to the image we get by convolving with the Gaussian -derivative filter.

= arg Y + λ ( + min 2 G 2 G y 2 ) Data term still has the same form as for denoising. But now, regularization just says we want the and y spatial derivatives of our image to be small Encodes a prior that natural images are smooth. This isn't a prior. Because gradients aren't a "representation" of. But better, because they don't enforce smoothness at alternate gradient locations. How do we minimize this?