Optimal Value Function Methods in Numerical Optimization Level Set Methods James V Burke Mathematics, University of Washington, (jvburke@uw.edu) Joint work with Aravkin (UW), Drusvyatskiy (UW), Friedlander (UBC/Davis), Roy (UW) 2016 Joint Mathematics Meeting SIAM Minisymposium on Optimization
Motivation Optimization in Large-Scale Inference A range of large-scale data science applications can be modeled using optimization: - Inverse problems (medical and seismic imaging ) - High dimensional inference (compressive sensing, LASSO, quantile regression) - Machine learning (classification, matrix completion, robust PCA, time series) These applications are often solved using side information: - Sparsity or low rank of solution - Constraints (topography, non-negativity) - Regularization (priors, total variation, dirty data) We need efficient large-scale solvers for nonsmooth programs. 2/19
The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax b (e.g. model selection, compressed sensing) 3/19
The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax b (e.g. model selection, compressed sensing) Convex approaches: BPDN LASSO Lagrangian min x s.t. x 1 1 2 Ax b 2 2 σ min x s.t. 1 2 Ax b 2 2 x 1 τ min x 1 2 Ax b 2 2 + λ x 1 3/19
The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax b (e.g. model selection, compressed sensing) Convex approaches: BPDN LASSO Lagrangian min x s.t. x 1 1 2 Ax b 2 2 σ min x s.t. 1 2 Ax b 2 2 x 1 τ min x 1 2 Ax b 2 2 + λ x 1 BPDN: often most natural and transparent. (e.g. physical considerations guide σ) 3/19
The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax b (e.g. model selection, compressed sensing) Convex approaches: BPDN LASSO Lagrangian min x s.t. x 1 1 2 Ax b 2 2 σ min x s.t. 1 2 Ax b 2 2 x 1 τ min x 1 2 Ax b 2 2 + λ x 1 BPDN: often most natural and transparent. (e.g. physical considerations guide σ) Lagrangian: ubiquitous in practice. (e.g. no constraints ) 3/19
The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax b (e.g. model selection, compressed sensing) Convex approaches: BPDN LASSO Lagrangian min x s.t. x 1 1 2 Ax b 2 2 σ min x s.t. 1 2 Ax b 2 2 x 1 τ min x 1 2 Ax b 2 2 + λ x 1 BPDN: often most natural and transparent. (e.g. physical considerations guide σ) Lagrangian: ubiquitous in practice. (e.g. no constraints ) All three are (essentially) equivalent computationally! 3/19
The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax b (e.g. model selection, compressed sensing) Convex approaches: BPDN LASSO Lagrangian min x s.t. x 1 1 2 Ax b 2 2 σ min x s.t. 1 2 Ax b 2 2 x 1 τ min x 1 2 Ax b 2 2 + λ x 1 BPDN: often most natural and transparent. (e.g. physical considerations guide σ) Lagrangian: ubiquitous in practice. (e.g. no constraints ) All three are (essentially) equivalent computationally! Basis for SPGL1 (van den Berg-Friedlander 08) 3/19
Optimal Value or Level Set Framework Problem class: Solve min x X s.t. φ(x) ρ(ax b) σ P(σ) 4/19
Optimal Value or Level Set Framework Problem class: Solve min x X s.t. φ(x) ρ(ax b) σ P(σ) Strategy: consider the flipped problem v(τ) := min x X s.t. ρ(ax b) φ(x) τ Q(τ) 4/19
Optimal Value or Level Set Framework Problem class: Solve min x X s.t. φ(x) ρ(ax b) σ P(σ) Strategy: consider the flipped problem v(τ) := min x X s.t. ρ(ax b) φ(x) τ Q(τ) Then opt-val(p(σ)) is the minimal root of the equation v(τ) = σ 4/19
Optimal Value or Level Set Framework Problem class: Solve min x X s.t. φ(x) ρ(ax b) σ P(σ) Strategy: consider the flipped problem v(τ) := min x X s.t. ρ(ax b) φ(x) τ Q(τ) Then opt-val(p(σ)) is the minimal root of the equation Assume X, ρ, φ are convex v(τ) = σ 4/19
Optimal Value or Level Set Framework Problem class: Solve min x X s.t. φ(x) ρ(ax b) σ P(σ) Strategy: consider the flipped problem v(τ) := min x X s.t. ρ(ax b) φ(x) τ Q(τ) Then opt-val(p(σ)) is the minimal root of the equation v(τ) = σ Assume X, ρ, φ are convex = v is convex. 4/19
Inexact Root Finding Problem: Find root of the inexactly known convex function v( ) σ. 5/19
Inexact Root Finding Problem: Find root of the inexactly known convex function v( ) σ. Bisection is one approach 5/19
Inexact Root Finding Problem: Find root of the inexactly known convex function v( ) σ. Bisection is one approach nonmonotone iterates (bad for warmstarts) at best linear convergence (with perfect information) 5/19
Inexact Root Finding Problem: Find root of the inexactly known convex function v( ) σ. Bisection is one approach nonmonotone iterates (bad for warmstarts) at best linear convergence (with perfect information) Solution: modified secant approximate Newton methods 5/19
Inexact Root Finding: Secant 6/19
Inexact Root Finding: Secant 6/19
Inexact Root Finding: Secant 6/19
Inexact Root Finding: Secant 6/19
Inexact Root Finding: Secant 6/19
Inexact Root Finding: Secant 6/19
Inexact Root Finding: Newton 6/19
Inexact Root Finding: Newton 6/19
Inexact Root Finding: Newton 6/19
Inexact Root Finding: Newton 6/19
Inexact Root Finding: Newton 6/19
Inexact Root Finding: Newton 6/19
Inexact Root Finding: Convergence 7/19
Inexact Root Finding: Convergence 7/19
Inexact Root Finding: Convergence 7/19
Inexact Root Finding: Convergence Key observation: C = C(τ 0 ) is independent of v (τ ). 7/19
Minorants from Duality 8/19
Minorants from Duality 8/19
Minorants from Duality 8/19
Robustness: 1 u/l α, where α [1, 2) 140 120 100 80 60 40 20 0 10 9 8 7 6 5 4 3 2 f 1 160 140 120 100 80 60 40 20 0 10 9 8 7 6 5 4 3 2 f 1 120 100 80 60 40 20 0 10 8 6 4 2 0 (a) k = 13, α = 1.3 (b) k = 770, α = 1.99 (c) k = 18, α = 1.3 140 120 100 80 60 40 20 0 10 9 8 7 6 5 4 3 2 f 1 160 140 120 100 80 60 40 20 0 10 9 8 7 6 5 4 3 2 f 1 0 10 8 6 4 2 0 (d) k = 9, α = 1.3 (e) k = 15, α = 1.99 (f) k = 10, α = 1.3 Figure: Inexact secant (top) and Newton (bottom) for f 1 (τ) = (τ 1) 2 10 (first two columns) and f 2 (τ) = τ 2 (last column). Below each panel, α is the oracle accuracy, and k is the number of iterations needed to converge, i.e., to reach f i (τ k ) ɛ = 10 2. 120 100 80 60 40 20 f 2 f 2 9/19
Sensor Network Localization (SNL) 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 Given a weighted graph G = (V, E, d) find a realization: p 1,..., p n R 2 with d ij = p i p j 2 for all ij E. 10/19
Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. 04, Biswas et al. 06): max tr (X) s.t. P E K(X) d 2 2 σ Xe = 0, X 0 where [K(X)] i,j = X ii + X jj 2X ij. 11/19
Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. 04, Biswas et al. 06): max tr (X) s.t. P E K(X) d 2 2 σ Xe = 0, X 0 where [K(X)] i,j = X ii + X jj 2X ij. Intuition: X = P P T and then tr (X) = 1 with p n + 1 i the ith row of P. n p i p j 2 i,j=1 11/19
Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. 04, Biswas et al. 06): max tr (X) s.t. P E K(X) d 2 2 σ Xe = 0, X 0 where [K(X)] i,j = X ii + X jj 2X ij. Intuition: X = P P T and then tr (X) = 1 with p n + 1 i the ith row of P. Flipped problem: min P E K(X) d 2 2 s.t. tr X = τ Xe = 0 X 0. n i,j=1 p i p j 2 11/19
Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. 04, Biswas et al. 06): max tr (X) s.t. P E K(X) d 2 2 σ Xe = 0, X 0 where [K(X)] i,j = X ii + X jj 2X ij. Intuition: X = P P T and then tr (X) = 1 with p n + 1 i the ith row of P. Flipped problem: min P E K(X) d 2 2 s.t. tr X = τ Xe = 0 X 0. Perfectly adapted for the Frank-Wolfe method. n i,j=1 p i p j 2 11/19
Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. 04, Biswas et al. 06): max tr (X) s.t. P E K(X) d 2 2 σ Xe = 0, X 0 where [K(X)] i,j = X ii + X jj 2X ij. Intuition: X = P P T and then tr (X) = 1 with p n + 1 i the ith row of P. Flipped problem: min P E K(X) d 2 2 s.t. tr X = τ Xe = 0 X 0. Perfectly adapted for the Frank-Wolfe method. n i,j=1 Key point: Slater failing (always the case) is irrelevant. p i p j 2 11/19
Approximate Newton 1.5 Pareto curve 1 0.5 0 0 5 10 15 20 Figure: σ = 0.25 12/19
Approximate Newton 1.5 Pareto curve 1.4 Pareto curve 1.2 1 1 0.8 0.6 0.5 0.4 0.2 0 0 5 10 15 20 0 0 5 10 15 20 Figure: σ = 0.25 Figure: σ = 0 12/19
Max-trace 0.6 Newton_max 0.6 Refine positions 0.4 0.4 0.2 0.2 0 0 0.2 0.2 0.4 0.4 0.5 0 0.5 0.5 0 0.5 13/19
Max-trace 0.6 Newton_max 0.6 Refine positions 0.4 0.4 0.2 0.2 0 0 0.2 0.2 0.4 0.4 0.5 0 0.5 Newton_min 0.6 0.4 0.2 0 0.2 0.4 0.5 0 0.5 Refine positions 0.6 0.4 0.2 0 0.2 0.4 0.5 0 0.5 0.5 0 0.5 13/19
Conclusion Simple strategy for optimizing over complex domains Rigorous convergence guarantees Insensitivity to ill-conditioning Many applications Sensor Network Localization (D-Krislock-Voronin-Wolkowicz 15) Large scale SDP and LP (cf. Renegar 14) Chromosome reconstruction (Aravkin-Becker-D-Lozano 15) Phase retrieval (Aravkin-Burke-D-Friedlander-Roy 15)... 14/19
Conclusion Simple strategy for optimizing over complex domains Rigorous convergence guarantees Insensitivity to ill-conditioning Many applications Sensor Network Localization (D-Krislock-Voronin-Wolkowicz 15) Large scale SDP and LP (cf. Renegar 14) Chromosome reconstruction (Aravkin-Becker-D-Lozano 15) Phase retrieval (Aravkin-Burke-D-Friedlander-Roy 15)... References : Optimal value function methods for numerical optimization, Aravkin-B-Drusvyatskiy-Friedlander-Roy, preprint, 2015. Noisy Euclid. distance realization: facial reduction and the Pareto frontier, Cheung-Drusvyatskiy-Krislock-Wolkowicz, submitted, 15. 14/19
More References Probing the pareto frontier for basis pursuit solutions van der Berg - Friedlander SIAM J. Sci. Comput. 31(2008), 890 912. Sparse optimization with least-squares constraints van der Berg - Friedlander SIOPT 21(2011), 1201 1229. Variational Properties of Value Functions. Aravkin - B - Friedlander SIOPT 23(2013), 1689 1717. 15/19
Basis Pursuit with Outliers BP σ : min x 1 st ρ(b Ax) σ Standard least-squares: ρ(z) = z 2 or ρ(z) = z 2 2. 16/19
Basis Pursuit with Outliers BP σ : min x 1 st ρ(b Ax) σ Standard least-squares: ρ(z) = z 2 or ρ(z) = z 2 2. Quantile Huber: τ r κτ 2 2 if r < τκ, 1 ρ κ,τ (r) = 2κ r2 if r [ κτ, (1 τ)κ], (1 τ) r κ(1 τ)2 2, if r > (1 τ)κ. Standard Huber when τ = 0.5. 16/19
Huber 17/19
Sparse and Robust Formulation HBP σ : min x 1 st ρ(b Ax) σ Huber2 3 Signal Recovery Problem Specification Truth x 20-sparse spike train in R 512 b measurements in R 120 LS A Measurement matrix satisfying RIP ρ Huber function σ error level set at.01 5 outliers LS 1 0 1 2 3 4 0 100 200 300 400 500 600 2 1.5 Residuals Results In the presence of outliers, the robust formulation recovers the spike train, while the standard formulation does not. Truth 1 0.5 0 0.5 1 Huber 1.5 2 0 20 40 60 80 100 120 18/19
Sparse and Robust Formulation HBP σ : min 0 x Problem Specification x 1 st ρ(b Ax) σ LS 4 3 2 Signal Recovery x 20-sparse spike train in R 512 + b measurements in R 120 A Measurement matrix satisfying RIP ρ Huber function σ error level set at.01 5 outliers Results In the presence of outliers, the robust formulation recovers the spike train, while the standard formulation does not. Truth 1 0 Huber 1 LS Truth 2 0 100 200 300 400 500 2 1.5 1 0.5 0 0.5 1 Residuals 1.5 2 Huber 2.5 0 20 40 60 80 100 120 19/19