CSE 559A: Computer Vision Tomorrow Zhihao's Office Hours back in Jolley 309: 10:30am-Noon

Similar documents
CSE 559A: Computer Vision

PSET 0 Due Today by 11:59pm Any issues with submissions, post on Piazza. Fall 2018: T-R: Lopata 101. Median Filter / Order Statistics

CSE 559A: Computer Vision EVERYONE needs to fill out survey.

CSE 559A: Computer Vision

GENERAL. CSE 559A: Computer Vision AUTOGRAD AUTOGRAD

Bayesian Paradigm. Maximum A Posteriori Estimation

STA 414/2104, Spring 2014, Practice Problem Set #1

ADMINISTRIVIA SENSOR

2 Statistical Estimation: Basic Concepts

Machine Learning Lecture 2

Expectation Maximization Deconvolution Algorithm

Machine Learning Lecture 2

Machine Learning (CS 567) Lecture 5

Notes on Discriminant Functions and Optimal Classification

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Mon Feb Matrix algebra and matrix inverses. Announcements: Warm-up Exercise:

5. Discriminant analysis

Probabilistic & Unsupervised Learning

ASummaryofGaussianProcesses Coryn A.L. Bailer-Jones

Basic concepts in estimation

An example to illustrate frequentist and Bayesian approches

September 12, Math Analysis Ch 1 Review Solutions. #1. 8x + 10 = 4x 30 4x 4x 4x + 10 = x = x = 10.

Where now? Machine Learning and Bayesian Inference

Machine Learning 4771

L20: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs

Lecture 3: Pattern Classification. Pattern classification

ECE521 week 3: 23/26 January 2017

Naïve Bayes classification

Naive Bayes and Gaussian Bayes Classifier

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Marginal density. If the unknown is of the form x = (x 1, x 2 ) in which the target of investigation is x 1, a marginal posterior density

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Algorithmisches Lernen/Machine Learning

GAUSSIAN PROCESS REGRESSION

Introduction to Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 2: Parameter Learning in Fully Observed Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan

Frequentist-Bayesian Model Comparisons: A Simple Example

State Space and Hidden Markov Models

Statistical learning. Chapter 20, Sections 1 3 1

Bayes Decision Theory - I

Mobile Robot Localization

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Naive Bayes and Gaussian Bayes Classifier

Lecture Note 2: Estimation and Information Theory

Overfitting, Bias / Variance Analysis

Naive Bayes and Gaussian Bayes Classifier

Noise-Blind Image Deblurring Supplementary Material

Bayesian Interpretations of Regularization

Statistical learning. Chapter 20, Sections 1 4 1

1 Kernel methods & optimization

Announcements Wednesday, November 01

Likelihood Bounds for Constrained Estimation with Uncertainty

1 Bayesian Linear Regression (BLR)

Probability and Estimation. Alan Moses

Introduction to Machine Learning

CPSC 340: Machine Learning and Data Mining

y Xw 2 2 y Xw λ w 2 2

Probabilistic Graphical Models

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Derivation of the Kalman Filter

Color Scheme. swright/pcmi/ M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

MACHINE LEARNING ADVANCED MACHINE LEARNING

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Linear Regression (9/11/13)

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Why do we care? Examples. Bayes Rule. What room am I in? Handling uncertainty over time: predicting, estimating, recognizing, learning

CS 4495 Computer Vision Principle Component Analysis

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Lecture : Probabilistic Machine Learning

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Announcements Wednesday, November 01

Covariance Matrix Simplification For Efficient Uncertainty Management

Machine Learning Lecture 3

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Parameter estimation Conditional risk

Bayesian Decision Theory

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Let's transfer our results for conditional probability for events into conditional probabilities for random variables.

Choosing among models

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Linear Regression (continued)

Stochastic Integration (Simple Version)

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Recursive Estimation

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Bayesian Inference of Noise Levels in Regression

Bayesian Linear Regression [DRAFT - In Progress]

Nonparametric Bayesian Methods (Gaussian Processes)

A 2. =... = c c N. 's arise from the three types of elementary row operations. If rref A = I its determinant is 1, and A = c 1

Matrix Dimensions(orders)

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Transcription:

CSE 559A: Computer Vision ADMINISTRIVIA Tomorrow Zhihao's Office Hours back in Jolley 309: 0:30am-Noon Fall 08: T-R: :30-pm @ Lopata 0 This Friday: Regular Office Hours Net Friday: Recitation for PSET Try all problems before coming to recitation Monday Office Hours again in Jolley 7 Instructor: Ayan Chakrabarti (ayan@wustl.edu. Course Staff: Zhihao ia, Charlie Wu, Han Liu http://www.cse.wustl.edu/~ayan/courses/cse559a/ Sep 3, 08 Setting There is a true image (or image-like object What we have is some degraded observation that we want to estimate that is based on The degradation might be "stochastic": so we say p(y Simplest Case (single piel: : Estimate from y: "denoising", y R y y = + σϵ p(y = (y; μ =, p(y is our observation model, called the likelihood function Likelihood of observation given true image Maimum Likelihood Estimator: = arg ma p(y For p(y = (y;,, what is ML Estimate? = y That's a little disappointing. y p(y = ep ( π σ (y

0 What we really want to do is maimize p( y, not the likelihood p(y Bayes Rule p(y is the distribution of observation y given true image p( y is the distribution of true given observation y Simple Derivation p( y = p(y p( p(y p(y p( : Likelihood : "Prior" Distribution on p( y = p(y p( p(y p( d What we believe about before we see the observation p( y : Posterior Distribution of given observation y What we believe about a er we see the observation p(, y = p(p(y = p(yp( y p( y = p(y p( p(y = p(y p( d p(y p( y = p(y p( p(y p( d p( y = p(y p( p(y p( d Maimum A Posteriori (MAP Estimation Maimum A Posteriori (MAP Estimation The most likely answer under the posterior distribution Other Estimators Possible: Try to minimize some "risk" or loss function Measures how bad an answer Minimize Epected Risk Under Posterior Let's say L(, = (. What is? = arg ma p( y is when true value is = arg min L(, p( yd L(, How do we compute this? = arg ma p( y = arg ma p( y = arg ma p(y p( p(y p( d = arg ma p(y p( = arg ma log[p(y p(] = arg ma log p(y + log p( = arg min log p(y log p( p( y = = p( yd (Turn this into minimization of a cost

Maimum A Posteriori (MAP Estimation Maimum A Posteriori (MAP Estimation Back to Denoising (y ( 0.5 = arg min + C + + C How do you compute? = arg min + (y ( 0.5 (y p(y = ep ( πσ Let's choose a simple prior: p( = (; 0.5, In this case, simpler option available. Complete the squares, epress as a single quadratic. More generally, to minimize C(, find C ( = 0, check C ( > 0 What is C (? C y ( = + 0.5 ( Maimum A Posteriori (MAP Estimation (y ( 0.5 = arg min + y C ( = + 0.5 = 0 y + 0.5σ = + σ C ( = + > 0 Maimum A Posteriori (MAP Estimation Weighted sum of observation and prior mean Closer to prior mean when (y = arg min σ ( 0.5 + is high y + 0.5σ = + σ Means the function is conve (quadratic functions with a positive coefficient for are conve

Let's go beyond "single piel images": Y [n] = [n] + ϵ[n] If noise is independent at each piel p({y[n]} {[n]} = p(y[n] [n] = (Y [n]; [n], σ n n Similarly, if prior p({[n]} is defined to model piels independently: Product turns into sum a er taking log p({[n]} = ([n]; 0.5, n (Y[n] [n] ([n] 0.5 [n] = arg min = + {[n]} Note: This is a minimization over multiple variables n But since the different variables don't interact with each other, we can optimize each piel separately Y[n] + 0.5 [n] = n + Multi-variate Gaussians Re-interpret these as multi-variate Gaussians For R d : (π d det(σ p( = ep ( ( μ ( μ Here, μ R d is a mean "vector", same size as Co-variance Σ is a symmetric, positive-definite matri d d matri When the different elements of are independent, off-diagonal elements of Σ are 0. T Σ Multi-variate Gaussians Represent and Y as vectorized images Here the mean-vector for Y is p(y = (Y ;, σ I The co-variance matri is a diagonal matri with all diagonal entries = σ How do we represent our prior and likelihood as a multi-variate Gaussian?

Multi-variate Gaussians μ =, Σ = I det(σ = ( d Σ = σ I (π d det(σ p(y = ep ( (Y μ (Y μ T Σ T T Σ (Y μ (Y μ = (Y I(Y = (Y I (Y = (Y (Y = Y T = (Y [n] [n] T min But now, we can use these multi-variate distributions to model correlations and interactions between different piels Let's stay with denoising, but talk about defining a prior in the Wavelet domain Instead of saying, individual piel values are independent, let's say individual wavelet coefficients are independent Let's also put different means and variances on different wavelet coefficients All the 'derivative' coefficients have zero mean Variance goes up as we go to coarser levels p(y = (Y;, σ I p( = (Y; 0.5I, I = arg Y + 0.5 σ n Let C = W, where W represents the Wavelet transform matri W is unitary, = = W T C W Define our prior on C: W T p(c = (C; μ c, D c Change of Variables We have probability distribution on a random variable U : p U (U U = f (V is a one-to-one function p V (V = p U (f (V? Realize, that p( are densities. So we need to account for "scaling" of the probability measure. Now, D c is a diagonal matri, with entries equal to corresponding variances of coefficients Off-diagonal elements 0 implies Wavelet co-efficients are un-correlated A prior on C implies a prior on C is just a different representation for where det( J f ij = dui dvj p (UdU U = p (f (V du U du = det( J f dv So, p V (V = p U (f (Vdet( J f If U = f (V = A V is linear, J = A. If A is unitary, det(a =.

Let C = W, where W represents the Wavelet transform matri Now, let's denoise with this prior. p(c = (C; μ c, D c p( = ep ( (W μ (W c T D c μ c (π det( d D c ep ( ( W T μ W ( c T W T D c W T μ c = (; W T μ c, W T D c W where, μ = W T μ c, Σ = W T D c W. (We're assuming noise variance is. How do we minimize this? min = arg Y + ( μ ( μ Take derivative, set to 0. Check second derivative is positive. Take gradient (vector derivative, set to 0. Check that "Hessian" is positive-definite. y = f ( is a scalar valued function of a vector R d. T Σ The gradient is a vector same sized as, with each entry being the partial derivative of y with respect to that element of. y y y = y 3 y = f ( is a scalar valued function of a vector R d. The gradient is a vector same sized as, with each entry being the partial derivative of y with respect to that element of. The Hessian is a matri of size d d for R d : y y y = y 3 ( H y ij = y i j Properties of a multi-variate quadratic form where Q is a symmetric d d matri, R is a d-dimensional vector, S is a scalar. Note that this is a scalar value Comes from the following identities T A = (A + A T T R = R T = R T Q T R + S = Q R The Hessian is simply given by Q (it is constant, doesn't depend on If Q is positive definite, then = 0 gives us the unique minimizer. Q = 0 Q R = 0 = R

where, μ = W T μ c, Σ = W T D c W. Q = (I + Σ min = arg Y + ( μ ( μ is positive definite (sum of two positive-definite matrices is positive-definite. i.e., taking inverse in the original de-correlated wavelet domain. i.e., inverse-wavelet transform of the scaled wavelet coefficients. T Σ = arg min (I + (Y + μ + ( Y + μ T Σ T Σ μ T Σ = (I + Σ (Y + Σ μ Σ = ( W T D c W = W D c W T = W T D c W Σ μ = W T D c Wμ = W T D c W W T μ c = W T D c μ c = arg min T (I + Σ T (Y + Σ μ + ( Y + μ T Σ μ = (I + Σ (Y + Σ μ Σ = W T D c W, Σ μ = W T D c Note that we can't actually form I + Σ, let alone invert it: d d matri, with d = total number of piels. I + D c I + Σ = I + W T D c W = W T W + W T D c W = W T IW + W T D c W = W T (I + D c W (I + Σ = W T (I + D c W is also diagonal, so it's inverse is just inverting diagonal elements. = W T (I + D c W (Y + W T D c μ c μ c Let's call C = W, = W Y. = arg min T (I + Σ T (Y + Σ μ + ( Y + μ T Σ μ C y = W T (I + D c W (Y + W T D c μ c W = W W T (I + D c (W Y + W W T D c μc C = (I + D c ( + C y D c μc So what we did is that we "denoised" in the wavelet domain, with the noise variance in the wavelet domain also being equal to. Y = + ϵ C y = C + W ϵ. W ϵ is also a random vector with 0 mean, and covariance = WI W T = I. So far, we've been in a Bayesian setting with a prior. = arg min log p(y log p( We're balancing off fidelity to observation (first term, likelihood, and what we believe about (second term, prior But we need p( to be 'a proper' probability distribution. Sometimes, that's too restrictive. Instead just think of these as a "data term" (still comes from observation model, and have a generic "regularization" term. Here, G is a vector corresponding to the image we get by convolving with the Gaussian -derivative filter. = arg min D(; Y + R( = arg Y + λ ( + min G G y

Data term still has the same form as for denoising. But now, regularization just says we want the and y spatial derivatives of our image to be small Encodes a prior that natural images are smooth. This isn't a prior. Because gradients aren't a "representation" of. But better, because they don't enforce smoothness at alternate gradient locations. How do we minimize this? = arg Y + λ ( + min G G y