Optimal Value Function Methods in Numerical Optimization Level Set Methods

Similar documents
Optimal Value Function Methods in Numerical Optimization Level Set Methods

Making Flippy Floppy

Making Flippy Floppy

c 2017 Society for Industrial and Applied Mathematics

Bayesian Models for Regularization in Optimization

FWI with Compressive Updates Aleksandr Aravkin, Felix Herrmann, Tristan van Leeuwen, Xiang Li, James Burke

Gauge optimization and duality

Seismic data interpolation and denoising using SVD-free low-rank matrix factorization

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Facial reduction for Euclidean distance matrix problems

Optimization for Compressed Sensing

Introduction to Compressed Sensing

Level-set methods for convex optimization

Sparse Solutions of an Undetermined Linear System

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Robust Principal Component Analysis

Recovery of Simultaneously Structured Models using Convex Optimization

Sparse and Robust Optimization and Applications

Solving Corrupted Quadratic Equations, Provably

Signal Recovery from Permuted Observations

Non-convex Robust PCA: Provable Bounds

Sparse Optimization Lecture: Basic Sparse Optimization Models

Three Generalizations of Compressed Sensing

6. Approximation and fitting

Rigorous Dynamics and Consistent Estimation in Arbitrarily Conditioned Linear Systems

Linear Regression. Volker Tresp 2018

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization

Convex optimization. Javier Peña Carnegie Mellon University. Universidad de los Andes Bogotá, Colombia September 2014

Mathematical Optimization Models and Applications

Provable Alternating Minimization Methods for Non-convex Optimization

Source estimation for frequency-domain FWI with robust penalties

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Recent Developments in Compressed Sensing

Compressive Sensing and Beyond

Optimization for Learning and Big Data

Low-Rank Matrix Completion using Nuclear Norm with Facial Reduction

An Introduction to Sparse Approximation

ParNes: A rapidly convergent algorithm for accurate recovery of sparse and approximately sparse signals

ROBUST BLIND SPIKES DECONVOLUTION. Yuejie Chi. Department of ECE and Department of BMI The Ohio State University, Columbus, Ohio 43210

Combining Sparsity with Physically-Meaningful Constraints in Sparse Parameter Estimation

Semidefinite Facial Reduction for Euclidean Distance Matrix Completion

Sparsity-promoting migration with multiples

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

arxiv: v1 [cs.it] 21 Feb 2013

Linear Inverse Problems

Robust estimation, efficiency, and Lasso debiasing

OWL to the rescue of LASSO

SOCP Relaxation of Sensor Network Localization

Generalized Orthogonal Matching Pursuit- A Review and Some

Sparse representation classification and positive L1 minimization

Fighting the Curse of Dimensionality: Compressive Sensing in Exploration Seismology

MS&E 318 (CME 338) Large-Scale Numerical Optimization. A Lasso Solver

Accelerated large-scale inversion with message passing Felix J. Herrmann, the University of British Columbia, Canada

Robust l 1 and l Solutions of Linear Inequalities

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

Facial Reduction for Cone Optimization

Sparse Recovery Beyond Compressed Sensing

Sparse Optimization Lecture: Dual Certificate in l 1 Minimization

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012

Reconstruction of Block-Sparse Signals by Using an l 2/p -Regularized Least-Squares Algorithm

An iterative hard thresholding estimator for low rank matrix recovery

Network Localization via Schatten Quasi-Norm Minimization

Sparse Approximation and Variable Selection

Learning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features

The Frank-Wolfe Algorithm:

On SDP and ESDP Relaxation of Sensor Network Localization

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Sparsity and Compressed Sensing

On SDP and ESDP Relaxation of Sensor Network Localization

Image Compression Using Simulated Annealing

CSC 576: Variants of Sparse Learning

Estimating Unknown Sparsity in Compressed Sensing

A Quantum Interior Point Method for LPs and SDPs

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Computing approximate PageRank vectors by Basis Pursuit Denoising

ADMM and Fast Gradient Methods for Distributed Optimization

Lecture Notes 9: Constrained Optimization

Matrix Support Functional and its Applications

Tractable Upper Bounds on the Restricted Isometry Constant

Fast Hard Thresholding with Nesterov s Gradient Method

STAT 200C: High-dimensional Statistics

Sparsifying Transform Learning for Compressed Sensing MRI

Proximal Methods for Optimization with Spasity-inducing Norms

Scalable Subspace Clustering

Compressed Sensing: Extending CLEAN and NNLS

Convex Optimization Algorithms for Machine Learning in 10 Slides

PHASE RETRIEVAL OF SPARSE SIGNALS FROM MAGNITUDE INFORMATION. A Thesis MELTEM APAYDIN

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

ABSTRACT. Recovering Data with Group Sparsity by Alternating Direction Methods. Wei Deng

Phase recovery with PhaseCut and the wavelet transform case

Composite nonlinear models at scale

Inference For High Dimensional M-estimates. Fixed Design Results

Department of Computer Science, University of British Columbia Technical Report TR , January 2008

LIMITATION OF LEARNING RANKINGS FROM PARTIAL INFORMATION. By Srikanth Jagabathula Devavrat Shah

A direct formulation for sparse PCA using semidefinite programming

Optimal Linear Estimation under Unknown Nonlinear Transform

Transcription:

Optimal Value Function Methods in Numerical Optimization Level Set Methods James V Burke Mathematics, University of Washington, (jvburke@uw.edu) Joint work with Aravkin (UW), Drusvyatskiy (UW), Friedlander (UBC/Davis), Roy (UW) 2016 Joint Mathematics Meeting SIAM Minisymposium on Optimization

Motivation Optimization in Large-Scale Inference A range of large-scale data science applications can be modeled using optimization: - Inverse problems (medical and seismic imaging ) - High dimensional inference (compressive sensing, LASSO, quantile regression) - Machine learning (classification, matrix completion, robust PCA, time series) These applications are often solved using side information: - Sparsity or low rank of solution - Constraints (topography, non-negativity) - Regularization (priors, total variation, dirty data) We need efficient large-scale solvers for nonsmooth programs. 2/19

The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax b (e.g. model selection, compressed sensing) 3/19

The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax b (e.g. model selection, compressed sensing) Convex approaches: BPDN LASSO Lagrangian min x s.t. x 1 1 2 Ax b 2 2 σ min x s.t. 1 2 Ax b 2 2 x 1 τ min x 1 2 Ax b 2 2 + λ x 1 3/19

The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax b (e.g. model selection, compressed sensing) Convex approaches: BPDN LASSO Lagrangian min x s.t. x 1 1 2 Ax b 2 2 σ min x s.t. 1 2 Ax b 2 2 x 1 τ min x 1 2 Ax b 2 2 + λ x 1 BPDN: often most natural and transparent. (e.g. physical considerations guide σ) 3/19

The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax b (e.g. model selection, compressed sensing) Convex approaches: BPDN LASSO Lagrangian min x s.t. x 1 1 2 Ax b 2 2 σ min x s.t. 1 2 Ax b 2 2 x 1 τ min x 1 2 Ax b 2 2 + λ x 1 BPDN: often most natural and transparent. (e.g. physical considerations guide σ) Lagrangian: ubiquitous in practice. (e.g. no constraints ) 3/19

The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax b (e.g. model selection, compressed sensing) Convex approaches: BPDN LASSO Lagrangian min x s.t. x 1 1 2 Ax b 2 2 σ min x s.t. 1 2 Ax b 2 2 x 1 τ min x 1 2 Ax b 2 2 + λ x 1 BPDN: often most natural and transparent. (e.g. physical considerations guide σ) Lagrangian: ubiquitous in practice. (e.g. no constraints ) All three are (essentially) equivalent computationally! 3/19

The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax b (e.g. model selection, compressed sensing) Convex approaches: BPDN LASSO Lagrangian min x s.t. x 1 1 2 Ax b 2 2 σ min x s.t. 1 2 Ax b 2 2 x 1 τ min x 1 2 Ax b 2 2 + λ x 1 BPDN: often most natural and transparent. (e.g. physical considerations guide σ) Lagrangian: ubiquitous in practice. (e.g. no constraints ) All three are (essentially) equivalent computationally! Basis for SPGL1 (van den Berg-Friedlander 08) 3/19

Optimal Value or Level Set Framework Problem class: Solve min x X s.t. φ(x) ρ(ax b) σ P(σ) 4/19

Optimal Value or Level Set Framework Problem class: Solve min x X s.t. φ(x) ρ(ax b) σ P(σ) Strategy: consider the flipped problem v(τ) := min x X s.t. ρ(ax b) φ(x) τ Q(τ) 4/19

Optimal Value or Level Set Framework Problem class: Solve min x X s.t. φ(x) ρ(ax b) σ P(σ) Strategy: consider the flipped problem v(τ) := min x X s.t. ρ(ax b) φ(x) τ Q(τ) Then opt-val(p(σ)) is the minimal root of the equation v(τ) = σ 4/19

Optimal Value or Level Set Framework Problem class: Solve min x X s.t. φ(x) ρ(ax b) σ P(σ) Strategy: consider the flipped problem v(τ) := min x X s.t. ρ(ax b) φ(x) τ Q(τ) Then opt-val(p(σ)) is the minimal root of the equation Assume X, ρ, φ are convex v(τ) = σ 4/19

Optimal Value or Level Set Framework Problem class: Solve min x X s.t. φ(x) ρ(ax b) σ P(σ) Strategy: consider the flipped problem v(τ) := min x X s.t. ρ(ax b) φ(x) τ Q(τ) Then opt-val(p(σ)) is the minimal root of the equation v(τ) = σ Assume X, ρ, φ are convex = v is convex. 4/19

Inexact Root Finding Problem: Find root of the inexactly known convex function v( ) σ. 5/19

Inexact Root Finding Problem: Find root of the inexactly known convex function v( ) σ. Bisection is one approach 5/19

Inexact Root Finding Problem: Find root of the inexactly known convex function v( ) σ. Bisection is one approach nonmonotone iterates (bad for warmstarts) at best linear convergence (with perfect information) 5/19

Inexact Root Finding Problem: Find root of the inexactly known convex function v( ) σ. Bisection is one approach nonmonotone iterates (bad for warmstarts) at best linear convergence (with perfect information) Solution: modified secant approximate Newton methods 5/19

Inexact Root Finding: Secant 6/19

Inexact Root Finding: Secant 6/19

Inexact Root Finding: Secant 6/19

Inexact Root Finding: Secant 6/19

Inexact Root Finding: Secant 6/19

Inexact Root Finding: Secant 6/19

Inexact Root Finding: Newton 6/19

Inexact Root Finding: Newton 6/19

Inexact Root Finding: Newton 6/19

Inexact Root Finding: Newton 6/19

Inexact Root Finding: Newton 6/19

Inexact Root Finding: Newton 6/19

Inexact Root Finding: Convergence 7/19

Inexact Root Finding: Convergence 7/19

Inexact Root Finding: Convergence 7/19

Inexact Root Finding: Convergence Key observation: C = C(τ 0 ) is independent of v (τ ). 7/19

Minorants from Duality 8/19

Minorants from Duality 8/19

Minorants from Duality 8/19

Robustness: 1 u/l α, where α [1, 2) 140 120 100 80 60 40 20 0 10 9 8 7 6 5 4 3 2 f 1 160 140 120 100 80 60 40 20 0 10 9 8 7 6 5 4 3 2 f 1 120 100 80 60 40 20 0 10 8 6 4 2 0 (a) k = 13, α = 1.3 (b) k = 770, α = 1.99 (c) k = 18, α = 1.3 140 120 100 80 60 40 20 0 10 9 8 7 6 5 4 3 2 f 1 160 140 120 100 80 60 40 20 0 10 9 8 7 6 5 4 3 2 f 1 0 10 8 6 4 2 0 (d) k = 9, α = 1.3 (e) k = 15, α = 1.99 (f) k = 10, α = 1.3 Figure: Inexact secant (top) and Newton (bottom) for f 1 (τ) = (τ 1) 2 10 (first two columns) and f 2 (τ) = τ 2 (last column). Below each panel, α is the oracle accuracy, and k is the number of iterations needed to converge, i.e., to reach f i (τ k ) ɛ = 10 2. 120 100 80 60 40 20 f 2 f 2 9/19

Sensor Network Localization (SNL) 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 Given a weighted graph G = (V, E, d) find a realization: p 1,..., p n R 2 with d ij = p i p j 2 for all ij E. 10/19

Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. 04, Biswas et al. 06): max tr (X) s.t. P E K(X) d 2 2 σ Xe = 0, X 0 where [K(X)] i,j = X ii + X jj 2X ij. 11/19

Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. 04, Biswas et al. 06): max tr (X) s.t. P E K(X) d 2 2 σ Xe = 0, X 0 where [K(X)] i,j = X ii + X jj 2X ij. Intuition: X = P P T and then tr (X) = 1 with p n + 1 i the ith row of P. n p i p j 2 i,j=1 11/19

Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. 04, Biswas et al. 06): max tr (X) s.t. P E K(X) d 2 2 σ Xe = 0, X 0 where [K(X)] i,j = X ii + X jj 2X ij. Intuition: X = P P T and then tr (X) = 1 with p n + 1 i the ith row of P. Flipped problem: min P E K(X) d 2 2 s.t. tr X = τ Xe = 0 X 0. n i,j=1 p i p j 2 11/19

Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. 04, Biswas et al. 06): max tr (X) s.t. P E K(X) d 2 2 σ Xe = 0, X 0 where [K(X)] i,j = X ii + X jj 2X ij. Intuition: X = P P T and then tr (X) = 1 with p n + 1 i the ith row of P. Flipped problem: min P E K(X) d 2 2 s.t. tr X = τ Xe = 0 X 0. Perfectly adapted for the Frank-Wolfe method. n i,j=1 p i p j 2 11/19

Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. 04, Biswas et al. 06): max tr (X) s.t. P E K(X) d 2 2 σ Xe = 0, X 0 where [K(X)] i,j = X ii + X jj 2X ij. Intuition: X = P P T and then tr (X) = 1 with p n + 1 i the ith row of P. Flipped problem: min P E K(X) d 2 2 s.t. tr X = τ Xe = 0 X 0. Perfectly adapted for the Frank-Wolfe method. n i,j=1 Key point: Slater failing (always the case) is irrelevant. p i p j 2 11/19

Approximate Newton 1.5 Pareto curve 1 0.5 0 0 5 10 15 20 Figure: σ = 0.25 12/19

Approximate Newton 1.5 Pareto curve 1.4 Pareto curve 1.2 1 1 0.8 0.6 0.5 0.4 0.2 0 0 5 10 15 20 0 0 5 10 15 20 Figure: σ = 0.25 Figure: σ = 0 12/19

Max-trace 0.6 Newton_max 0.6 Refine positions 0.4 0.4 0.2 0.2 0 0 0.2 0.2 0.4 0.4 0.5 0 0.5 0.5 0 0.5 13/19

Max-trace 0.6 Newton_max 0.6 Refine positions 0.4 0.4 0.2 0.2 0 0 0.2 0.2 0.4 0.4 0.5 0 0.5 Newton_min 0.6 0.4 0.2 0 0.2 0.4 0.5 0 0.5 Refine positions 0.6 0.4 0.2 0 0.2 0.4 0.5 0 0.5 0.5 0 0.5 13/19

Conclusion Simple strategy for optimizing over complex domains Rigorous convergence guarantees Insensitivity to ill-conditioning Many applications Sensor Network Localization (D-Krislock-Voronin-Wolkowicz 15) Large scale SDP and LP (cf. Renegar 14) Chromosome reconstruction (Aravkin-Becker-D-Lozano 15) Phase retrieval (Aravkin-Burke-D-Friedlander-Roy 15)... 14/19

Conclusion Simple strategy for optimizing over complex domains Rigorous convergence guarantees Insensitivity to ill-conditioning Many applications Sensor Network Localization (D-Krislock-Voronin-Wolkowicz 15) Large scale SDP and LP (cf. Renegar 14) Chromosome reconstruction (Aravkin-Becker-D-Lozano 15) Phase retrieval (Aravkin-Burke-D-Friedlander-Roy 15)... References : Optimal value function methods for numerical optimization, Aravkin-B-Drusvyatskiy-Friedlander-Roy, preprint, 2015. Noisy Euclid. distance realization: facial reduction and the Pareto frontier, Cheung-Drusvyatskiy-Krislock-Wolkowicz, submitted, 15. 14/19

More References Probing the pareto frontier for basis pursuit solutions van der Berg - Friedlander SIAM J. Sci. Comput. 31(2008), 890 912. Sparse optimization with least-squares constraints van der Berg - Friedlander SIOPT 21(2011), 1201 1229. Variational Properties of Value Functions. Aravkin - B - Friedlander SIOPT 23(2013), 1689 1717. 15/19

Basis Pursuit with Outliers BP σ : min x 1 st ρ(b Ax) σ Standard least-squares: ρ(z) = z 2 or ρ(z) = z 2 2. 16/19

Basis Pursuit with Outliers BP σ : min x 1 st ρ(b Ax) σ Standard least-squares: ρ(z) = z 2 or ρ(z) = z 2 2. Quantile Huber: τ r κτ 2 2 if r < τκ, 1 ρ κ,τ (r) = 2κ r2 if r [ κτ, (1 τ)κ], (1 τ) r κ(1 τ)2 2, if r > (1 τ)κ. Standard Huber when τ = 0.5. 16/19

Huber 17/19

Sparse and Robust Formulation HBP σ : min x 1 st ρ(b Ax) σ Huber2 3 Signal Recovery Problem Specification Truth x 20-sparse spike train in R 512 b measurements in R 120 LS A Measurement matrix satisfying RIP ρ Huber function σ error level set at.01 5 outliers LS 1 0 1 2 3 4 0 100 200 300 400 500 600 2 1.5 Residuals Results In the presence of outliers, the robust formulation recovers the spike train, while the standard formulation does not. Truth 1 0.5 0 0.5 1 Huber 1.5 2 0 20 40 60 80 100 120 18/19

Sparse and Robust Formulation HBP σ : min 0 x Problem Specification x 1 st ρ(b Ax) σ LS 4 3 2 Signal Recovery x 20-sparse spike train in R 512 + b measurements in R 120 A Measurement matrix satisfying RIP ρ Huber function σ error level set at.01 5 outliers Results In the presence of outliers, the robust formulation recovers the spike train, while the standard formulation does not. Truth 1 0 Huber 1 LS Truth 2 0 100 200 300 400 500 2 1.5 1 0.5 0 0.5 1 Residuals 1.5 2 Huber 2.5 0 20 40 60 80 100 120 19/19