Bayesian Paradigm. Maximum A Posteriori Estimation

Similar documents
Sparse & Redundant Representations by Iterated-Shrinkage Algorithms

ITERATED SRINKAGE ALGORITHM FOR BASIS PURSUIT MINIMIZATION

Inverse problem and optimization

CSE 559A: Computer Vision

CSE 559A: Computer Vision Tomorrow Zhihao's Office Hours back in Jolley 309: 10:30am-Noon

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Sparse linear models

2.3. Clustering or vector quantization 57

Sparse & Redundant Signal Representation, and its Role in Image Processing

MCMC Sampling for Bayesian Inference using L1-type Priors

MMSE Denoising of 2-D Signals Using Consistent Cycle Spinning Algorithm

Non-convex optimization. Issam Laradji

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Machine Learning for Signal Processing Sparse and Overcomplete Representations

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

Statistical approach for dictionary learning

Curve Fitting Re-visited, Bishop1.2.5

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Independent Component Analysis. Contents

Statistical Data Mining and Machine Learning Hilary Term 2016

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Single-channel source separation using non-negative matrix factorization

Convex Optimization and l 1 -minimization

Least Sparsity of p-norm based Optimization Problems with p > 1

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond

Lecture 5 : Projections

Recent Advances in Bayesian Inference for Inverse Problems

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

SPARSE SIGNAL RESTORATION. 1. Introduction

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Rigorous Dynamics and Consistent Estimation in Arbitrarily Conditioned Linear Systems

LINEARIZED BREGMAN ITERATIONS FOR FRAME-BASED IMAGE DEBLURRING

STA414/2104 Statistical Methods for Machine Learning II

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Lecture 4 September 15

Introduction to Alternating Direction Method of Multipliers

(4D) Variational Models Preserving Sharp Edges. Martin Burger. Institute for Computational and Applied Mathematics

Wavelet Footprints: Theory, Algorithms, and Applications

Optimization Algorithms for Compressed Sensing

Approximate Inference Part 1 of 2

Quantitative Biology II Lecture 4: Variational Methods

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Approximate Inference Part 1 of 2

Linear convergence of iterative soft-thresholding

Adaptive Corrected Procedure for TVL1 Image Deblurring under Impulsive Noise

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1

1 Sparsity and l 1 relaxation

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases

Sparse Gaussian conditional random fields

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Sparsifying Transform Learning for Compressed Sensing MRI

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Data Sparse Matrix Computation - Lecture 20

Posterior Regularization

Sparsity Regularization

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction

Sparse Linear Models (10/7/13)

Non-Negative Matrix Factorization with Quasi-Newton Optimization

Morphological Diversity and Source Separation

Algorithmisches Lernen/Machine Learning

SPARSE signal representations have gained popularity in recent

Gaussian Process Approximations of Stochastic Differential Equations

Optimization methods

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Recent developments on sparse representation

Overcomplete Dictionaries for. Sparse Representation of Signals. Michal Aharon

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Nonlinear Optimization for Optimal Control

Chapter 3 Numerical Methods

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

Sparse Approximation and Variable Selection

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Convex functions, subdifferentials, and the L.a.s.s.o.

Conditional Gradient (Frank-Wolfe) Method

One Picture and a Thousand Words Using Matrix Approximtions October 2017 Oak Ridge National Lab Dianne P. O Leary c 2017

ECE521 week 3: 23/26 January 2017

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Low-Complexity Image Denoising via Analytical Form of Generalized Gaussian Random Vectors in AWGN

Oslo Class 6 Sparsity based regularization

Multiple Change Point Detection by Sparse Parameter Estimation

Generalized Orthogonal Matching Pursuit- A Review and Some

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Super-Resolution. Shai Avidan Tel-Aviv University

Introduction to Probabilistic Machine Learning

Expectation maximization tutorial

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Transcription:

Bayesian Paradigm Maximum A Posteriori Estimation

Simple acquisition model noise + degradation

Constraint minimization or

Equivalent formulation Constraint minimization Lagrangian (unconstraint minimization) Maximum Aposteriori (MAP)

Discrete representation Image is a random field Each pixel value is a realization of some random variable

Random variables Discrete: takes a countable number of distinct values e.g. num. of children in a family Continuous: x probability density function ( distribution ): Moments: expectation ( average ), variance

Random variables Expectation Variance

Image as a random field

Normal distribution

Multivariate Normal Distribution

Multivariate Normal Distribution

Gamma distribution

Bayesian Paradigm a posteriori distribution unknown likelihood given by our problem a priori distribution our prior knowledge Maximum a posteriori (MAP): max p(u z) Maximum likelihood (MLE): max p(z u)

Likelihood

Image Prior 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-0.5-0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4 0.5 Intensity histogram Gradient histogram

Image Prior Intensities Gradients (log)

Image Prior 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-0.5-0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4 0.5 Gradient histogram

Image Prior 1 0.9 0.8 0.7 Tikhonov regularization 0.6 0.5 0.4 0.3 0.2 0.1 0-0.5-0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4 0.5 1 0.9 0.8 0.7 0.6 0.5 TV regularization 0.4 0.3 0.2 0.1 0-0.5-0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4 0.5

Image Prior Non-convex regularization

MAP

Kullback-Leibler divergence Measures similarity between two distributions

Convex function f(x) Jensen s inequality:

Kullback-Leibler divergence is strictly convex

Variational Bayes Approximate posterior with a simpler form Minimize with respect to E-L:

Variational Bayes

Variational Bayes Minimize with respect to all other factors

Variational Bayes Minimize with respect to

Factorized approximation

GM approximated by a single Gaussian depends on initialization

Denoising with unknown noise level Noise model: Bayes: and is unknown

MAP denoising Minimize the -log posterior with respect to : with respect to :

A Posteriori surface

MAP denoising

Marginalized denoising Expectation of marginalized posterior

Marginalized denoising

VB denoising Find the factorized approximation of the posterior: image:

VB denoising noise precision:

VB denoising

Comparission

Sparse & Redundant Representation Iterated Shrinkage Algorithms Wavelet thresholding

We will Talk About The minimization of the functional Substitution: and assume D is invertible : This is our original denoising functional, if Moore-Penrose pseudoinverse

A Closer Look at the Functional D rectangular, overcomplete dictionary -> underdetermined system a sparse representation ρ(a) measure of sparsity

Measure of Sparsity norms ( ) norm, counts nonzero elements many other sparsity measures smooth l 1

l 2 unit ball

l 1 unit ball

l 0.9 unit ball

l 0.5 unit ball

l 0.3 unit ball

Equivalent formulation Method of Lagrange multipliers

l 2 -norm Solution via pseudoinverse is simple but not sparse:

l 1 -norm

Why Seeking Sparse Representation? Assume x is 1-sparse signal in the dictionary is the perfect solution Overcomplete dictionaries => sparsity

Classical Solvers Steepest Descent: Hessian: Convergence depends on the condition number κ of the Hessian: Hessian is ill-conditioned => poor convergence

The Unitary Case (DD T = I) Use Define We have m independent 1D optimization problems l 2 is unitarily invariant

The 1D Task We need to solve the following 1D problem: LUT a opt b Such a Look-Up-Table (LUT) can be built for ANY sparsity measure function ρ(a), including non-convex ones and non-smooth ones (e.g., l 0 norm).

Shrinkage for ρ(a)= a p

Soft & Hard Thresholding Soft thresholding Hard thresholding

The Unitary Case: Summary Minimizing is done by: x Multiply by D H b LUT S, b a Multiply by D xˆ DONE! The obtained solution is the GLOBAL minimizer of f(a), even if f(a) is non-convex.

The minimization of Leads to two very Contradicting Observations: 1. The problem is quite hard classic optimization find it hard. 2. The problem is trivial for the case of unitary D. How Can We Enjoy This Simplicity in the General Case?

Iterated-Shrinkage Algorithms Common to all is a set of operations in every iteration that includes: (i) Multiplication by D (ii) Multiplication by D T, and (iii) A Scalar shrinkage on the solution S ρ,λ (a).

The Proximal-Point Method Aim: minimize f(a) Suppose it is found to be too hard. Define a surrogate-function g(a,a 0 )=f(a)+dist(a-a 0 ). Then, the following algorithm necessarily converges to a local minima of f(a) [Rokafellar, `76]: a 0 a 1 Minimize Minimize a 2 a k a k+1 Minimize g(a,a 0 ) g(a,a 1 ) g(a,a k ) Comments: (i) The minimization of g(a,a 0 ) should be easier. (ii) Looks like it will slow-down convergence. Really?

The Proposed Surrogate-Function Original function: Distance: The beauty is that vanishes! Not a function of a. Minimization of g(a,a 0 ) is done in a closed form by shrinkage, done on the vector b k, and this generates the solution a k+1 of the next iteration. We have m independent 1D optimization problems

Iterated-Shrinkage Algorithm While the Unitary case solution is given by x Multiply by D T b LUT S, b aˆ Multiply by D xˆ the general case, by SSF requires: x + - + Multiply by D T /c + b k S, c b a LUT k 1 Multiply by D xˆ

Linear Programming Basis Pursuit: LP: Interior-point Algorithms for LP problems

Compressed Sensing In compressed sensing we compress the signal x of size m by exploiting its origin. This is done by p<<m random projections. The core idea: (P size: p m) holds all the information about the original signal x, even though p<<m. Reconstruction? Done by MAP and solving y M Da x Px Noise? xˆ