Bayesian Models for Regularization in Optimization

Size: px

Start display at page:

Download "Bayesian Models for Regularization in Optimization"

Bryan Butler
5 years ago
Views:

1 Bayesian Models for Regularization in Optimization Aleksandr Aravkin, UBC Bradley Bell, UW Alessandro Chiuso, Padova Michael Friedlander, UBC Gianluigi Pilloneto, Padova Jim Burke, UW MOPTA, Lehigh University, August 17, 2011

2 Outline The Optimization Problem Applications PLQ Functions Log-Concave PLQ Densities Interior Point Methods for PLQ Optimization Example: Robust Kalman Smoothing PLQ Objectives with PLQ Constraints

3 The Optimization Problem min x X ρ(f (x))

4 The Optimization Problem min x X ρ(f (x)) Example: ρ and error function and F (x) = Ax y

5 The Optimization Problem min x X ρ(f (x)) Example: ρ and error function and F (x) = Ax y ρ typically convex

6 The Optimization Problem min x X ρ(f (x)) Example: ρ and error function and F (x) = Ax y ρ typically convex Convex composite optimization

7 The Optimization Problem min x X ρ(f (x)) Example: ρ and error function and F (x) = Ax y ρ typically convex Examples of ρ (up to rescaling)

8 The Optimization Problem min x X ρ(f (x)) Example: ρ and error function and F (x) = Ax y ρ typically convex Examples of ρ (up to rescaling) 2

9 The Optimization Problem min x X ρ(f (x)) Example: ρ and error function and F (x) = Ax y ρ typically convex Examples of ρ (up to rescaling) 2 1

10 The Optimization Problem min x X ρ(f (x)) Example: ρ and error function and F (x) = Ax y ρ typically convex Examples of ρ (up to rescaling) 2 1 ρ H ( ) (Huber)

11 The Optimization Problem min x X ρ(f (x)) Example: ρ and error function and F (x) = Ax y ρ typically convex Examples of ρ (up to rescaling) 2 1 ρ H ( ) (Huber) ρ V ( ) (Vapnik)

12 The Optimization Problem min x X ρ(f (x)) Example: ρ and error function and F (x) = Ax y ρ typically convex Examples of ρ (up to rescaling) 2 1 ρ H ( ) (Huber) ρ V ( ) (Vapnik) Or ρ is a combination of these as well as the convex indicators for the level sets for such functions (ρ(y) τ).

13 Graphs of ρ y y x x V (x) = 1 2 x2 V (x) = x y y K K x ɛ ɛ x V (x) = Kx 1 2 K2 ; x < K V (x) = 1 2 x2 ; K x K V (x) = Kx 1 2 K2 ; K < x V (x) = x ɛ; x < ɛ V (x) = 0; ɛ x ɛ V (x) = x ɛ; ɛ x

14 Applications Robust Kalman Filtering (RFPK-UW-NIH, APL-UW-NOAA) tracking: drug concentrations, underwater vehicles

15 Applications Robust Kalman Filtering Global Health: Burden of Disease Models (IHME-UW-GF)

16 Applications Robust Kalman Filtering Global Health: Burden of Disease Models Robust Bundle Adjustment Algorithms (NASA-Ames)

17 Applications Robust Kalman Filtering Global Health: Burden of Disease Models Robust Bundle Adjustment Algorithms Sparsity Optimization

18 Applications Robust Kalman Filtering Global Health: Burden of Disease Models Robust Bundle Adjustment Algorithms Sparsity Optimization Machine Learning (Reproducing Kernel Hilbert Spaces) Control sensor distribution networks

19 Applications Robust Kalman Filtering Global Health: Burden of Disease Models Robust Bundle Adjustment Algorithms Sparsity Optimization Machine Learning (Reproducing Kernel Hilbert Spaces) Geophysical Inverse Problems (SLIM-UBC-NSERC)

20 Piecewise Linear Quadratic (PLQ) Penalties (Rockafellar-Wets 86) θ U,M (w) := sup { u, w 12 } u, Mu u U M S k + U R k polyhedral convex

21 Piecewise Linear Quadratic (PLQ) Penalties (Rockafellar-Wets 86) θ U,M (w) := sup { u, w 12 } u, Mu u U M S k + U R k polyhedral convex Examples: 1 2 w 2 2 = sup u R n [ u, w 12 u, u ] w 1 = sup u, w 1 u i 1

22 Huber ρ H as a PLQ function y K K x ρ H (w) = V (x) = Kx 1 2 K2 ; x< K V (x) = 1 2 x2 ; K x K V (x) =Kx 1 2 K2 ; K < x sup { w, u 12 } u, u. u [ K,K]

23 Vapnik ρ V as a PLQ function Modest extension: ρ U,M,b,B (y) := θ U,M (b + By) = sup u U { u, b + By 12 u, Mu } B R s k injective b R s

24 Vapnik ρ V as a PLQ function Modest extension: ρ U,M,b,B (y) := θ U,M (b + By) = sup u U { u, b + By 12 u, Mu } B R s k injective b R s y ɛ ɛ x V (x) = x ɛ; x< ɛ V (x) = 0; ɛ x ɛ ρ V (x) =x ɛ; ɛ x V (y) = sup b + By, u u U [ ] U = [0, 1] k [0, 1] k I B = I b = ( ɛ1 ɛ1 )

25 Optimization Model Class ρ(f (x))

26 Optimization Model Class ρ(f (x)) ρ is the optimization model

27 Optimization Model Class ρ(f (x)) ρ is the optimization model F is the data for the optimization model

28 Optimization Model Class ρ(f (x)) ρ is the optimization model F is the data for the optimization model How is the model ρ chosen to reflect our knowledge about the problem data F and the nature of the solution x?

29 Optimization Model Class ρ(f (x)) ρ is the optimization model F is the data for the optimization model How is the model ρ chosen to reflect our knowledge about the problem data F and the nature of the solution x? Consider linear regression as a prototypical example.

30 Optimization Model Class ρ(f (x)) ρ is the optimization model F is the data for the optimization model How is the model ρ chosen to reflect our knowledge about the problem data F and the nature of the solution x? Consider linear regression as a prototypical example. ρ(f (x)) = 1 2 Ax y 2 2

31 Optimization Model Class ρ(f (x)) ρ is the optimization model F is the data for the optimization model How is the model ρ chosen to reflect our knowledge about the problem data F and the nature of the solution x? Consider linear regression as a prototypical example. ρ(f (x)) = 1 2 Ax y ˆρ(x) Bayesian prior

32 Optimization Model Class ρ(f (x)) ρ is the optimization model F is the data for the optimization model How is the model ρ chosen to reflect our knowledge about the problem data F and the nature of the solution x? Consider linear regression as a prototypical example. ρ(f (x)) = 1 2 Ax y ˆρ(x) Bayesian prior Maximum Likelihood Estimation: ρ(f (x)) is a negative log-likelihood of the joint density.

33 Log-Concave PLQ Densities Define probability densities p(y) exp ( ρ U,M,B,b (y)) on aff (dom (ρ)) = B T (U Null(M)) U is the horizon cone (or recession cone) of the convex set U (set of directions in which U is unbounded).

34 Log-Concave PLQ Densities Define probability densities p(y) exp ( ρ U,M,B,b (y)) on aff (dom (ρ)) = B T (U Null(M)) U is the horizon cone (or recession cone) of the convex set U (set of directions in which U is unbounded). When are these true densities?

35 PLQ Densities THEOREM: (PLQ Integrability) Suppose ρ(y) is coercive, and let n aff denote the dimension of aff (dom (ρ)). Then the function f (y) = exp ( ρ(y)) is integrable on aff (dom (ρ)) with the n aff -dimensional Lebesgue measure.

36 PLQ Densities THEOREM: (PLQ Integrability) Suppose ρ(y) is coercive, and let n aff denote the dimension of aff (dom (ρ)). Then the function f (y) = exp ( ρ(y)) is integrable on aff (dom (ρ)) with the n aff -dimensional Lebesgue measure. THEOREM: (Coercivity of ρ) ρ is coercive if and only if [B T cone (U)] = {0}, or equivalently if B T cone (U) = R n.

37 PLQ Densities THEOREM: (PLQ Integrability) Suppose ρ(y) is coercive, and let n aff denote the dimension of aff (dom (ρ)). Then the function f (y) = exp ( ρ(y)) is integrable on aff (dom (ρ)) with the n aff -dimensional Lebesgue measure. THEOREM: (Coercivity of ρ) ρ is coercive if and only if [B T cone (U)] = {0}, or equivalently if B T cone (U) = R n , 1, ρ H, ρ V all generate true probability densities.

38 PLQ Densities DEFINITION: Let ρ be any coercive piecewise linear quadratic function on R n of the form ρ U,M,B,b (y) = θ U,M (b + By). Define p(y) to be the density { c1 1 p(y) = exp ( c 2ρ(y)) y dom (ρ) 0 else, where c 2 is a positive constant and ( ) c 1 = exp ( c 2 ρ(y)) dy. y dom (ρ) The integral above is with respect to the Lebesgue measure with dimension dim (aff (dom (ρ))).

39 Constructing PLQ Densities y = (y 1,..., y n ) T a vector of independent PLQ random variables with mean 0 and variance 1.

40 Constructing PLQ Densities y = (y 1,..., y n ) T a vector of independent PLQ random variables with mean 0 and variance 1. Each y i has parameters b i, B i, U i, M i. Set U = U 1 U 2 U n M = diag (M 1, M 2,..., M n ) B = diag (B 1, B 2,..., B n ) b = vec[b 1, b 2,..., b n ].

41 Constructing PLQ Densities y = (y 1,..., y n ) T a vector of independent PLQ random variables with mean 0 and variance 1. Each y i has parameters b i, B i, U i, M i. Set U = U 1 U 2 U n M = diag (M 1, M 2,..., M n ) B = diag (B 1, B 2,..., B n ) b = vec[b 1, b 2,..., b n ]. The random vector z = A 1/2 (y + µ) has mean µ and variance A.

42 Constructing PLQ Densities y = (y 1,..., y n ) T a vector of independent PLQ random variables with mean 0 and variance 1. Each y i has parameters b i, B i, U i, M i. Set U = U 1 U 2 U n M = diag (M 1, M 2,..., M n ) B = diag (B 1, B 2,..., B n ) b = vec[b 1, b 2,..., b n ]. The random vector z = A 1/2 (y + µ) has mean µ and variance A. If C is the normalizing constant for y, then C det(a) 1/2 is the normalizing constant for z.

43 PLQ Normalizing Constants Suppose ρ(y) is a scalar PLQ penalty symmetric about 0. Then is a PLQ density when p(y) = 1 c 1 exp ( ρ(c 2 y)) c 2 = c 1 = 1 c 2 u 2 exp ( ρ(u)) du exp ( ρ(u)) du exp ( ρ(u)) du.

44 PLQ Normalizing Constants Suppose ρ(y) is a scalar PLQ penalty symmetric about 0. Then is a PLQ density when p(y) = 1 c 1 exp ( ρ(c 2 y)) c 2 = c 1 = 1 c 2 u 2 exp ( ρ(u)) du exp ( ρ(u)) du exp ( ρ(u)) du. We need to compute u 2 exp ( ρ(u)) du and exp ( ρ(u)) du.

45 Huber Normalizing Constants exp ( ρ H (y)) dy = 2exp ( K 2 /2 ) 1 K + 2π (2Φ(K) 1) y 2 exp ( ρ H (y)) dy = 4exp ( K 2 /2 ) 1 + K 2 K 3 + 2π (2Φ(K) 1), where Φ is the standard normal CDF.

46 Vapnik Normalizing Constants exp ( ρ V (y)) dy = 2(ɛ + 1) y 2 exp ( ρ V (y)) dy = 2 3 ɛ3 + 2(2 2ɛ + ɛ 2 )

47 PLQ Optimization with min y { ρ U,M,b,B (y) := sup u, b + By 1 } u U 2 ut Mu U = {u : C T u c}.

48 PLQ Optimization with min y { ρ U,M,b,B (y) := sup u, b + By 1 } u U 2 ut Mu U = {u : C T u c}. ρ U,M,b,B (y) = θ U1,M 1 (Ay r) + θ U2,M 2 (y)

49 PLQ Optimization with min y { ρ U,M,b,B (y) := sup u, b + By 1 } u U 2 ut Mu U = {u : C T u c}. KKT Conditions: 0 = B T u 0 = b + By Mu Cq 0 = C T u + s c 0 = q i s i i, q, s 0.

50 Interior Point Methods (IPM) 0 = B T u 0 = b + By Mu Cq 0 = C T u + s c τ = q i s i i, q, s 0.

51 Interior Point Methods (IPM) 0 = B T u 0 = b + By Mu Cq 0 = C T u + s c τ = q i s i i, q, s 0. THEOREM: This KKT system can be solved using an IPM if and only if Null (M) Null (C) = {0}. In particular, this is implied by the condition dom (θ U,M ) = R m.

52 Example: Robust Kalman Smoothing x k = g k (x k 1 ) + w k z k = h k (x k ) + v k, where g k : R n R n a known process function h k : R n R m(k) a known measurement function w k unknown Gaussian process noise N(0, Q k ) v k unknown l 1 -Laplace measurement noise L 1 (0, R k )

53 Robust Kalman Smoothing An unknown linear deterministic process. ( ) ( ) 1 X2 (t) X (0) =, Ẋ (t) = 0 X 1 (t) ( ) cos(t) i.e., X (t) =. sin(t) ;

54 Robust Kalman Smoothing An unknown linear deterministic process. ( ) ( ) 1 X2 (t) X (0) =, Ẋ (t) = 0 X 1 (t) ( ) cos(t) i.e., X (t) =. sin(t) ; k = 0,..., N, let t k = k t and x k = X (t k ) ( t/n, N = 100) g k (x k 1 ) = [ ] 1 0 x t 1 k 1 +w k, w k N(0, Q k ), Q k = z k = X 2 (t k ) + v k R k = We assume the observations z k have outliers. [ t t 2 ] /2 t 2 /2 t 3 /3

55 Robust Kalman Smoothing v k (1 p)n(0, 0.25) + pn(0, φ) Function units Time Simulation: measurements (+), outliers (o) (absolute residuals more than three standard deviations), true function (thick line), l 1 -Laplace estimate (thin line), Gaussian estimate (dashed line), Gaussian outlier removal estimate (dotted line)

56 Robust Kalman Smoothing Mean Squared Error MSE = 1 N N [x 1,k ˆx 1,k ] 2 + [x 2,k ˆx 2,k ] 2 k=1 Table: Median MSE and 95% confidence intervals for the different estimation methods p φ GKF RKF IGS ILS 0.34 (.24,.47).42 (.15, 1.1).04(.02,.1).04(.01,.1) (.26,.60).48 (.15, 1.1).06(.02,.12).04(.02,.10) (.32, 1.1).56 (.18, 1.5).09(.04,.29).05(.02,.12) (.42, 2.3).58 (.19, 1.7).17(.05,.55).05(.02,.13) (1.7, 17.9).55 (.18, 2.0) 1.3(.30, 5.0).05(.02,.14) 1000 realizations of each: v k (1 p)n(0, 0.25) + pn(0, φ).

57 PLQ Objectives with PLQ Constraints P(ψ, φ, τ) minimize x X ψ(x) subject to φ(x) τ.

58 PLQ Objectives with PLQ Constraints P(ψ, φ, τ) minimize x X ψ(x) subject to φ(x) τ. v 1 (τ) = min {ψ(x) x X, φ(x) τ } v 2 (β) = min {φ(x) x X, ψ(x) β }.

59 PLQ Objectives with PLQ Constraints P(ψ, φ, τ) minimize x X ψ(x) subject to φ(x) τ. v 1 (τ) = min {ψ(x) x X, φ(x) τ } v 2 (β) = min {φ(x) x X, ψ(x) β }. v 1 and v 2 are both convex functions.

60 An Inverse Function Theorem for Optimal-Value Functions Suppose that there is an interval (τ l, τ u ) R {± } with (τ l, τ u ) R such that τ (τ l, τ u ) argmin P(ψ, φ, τ) {x X φ(x) = τ }. Then, for every τ (τ l, τ u ),

61 An Inverse Function Theorem for Optimal-Value Functions Suppose that there is an interval (τ l, τ u ) R {± } with (τ l, τ u ) R such that τ (τ l, τ u ) argmin P(ψ, φ, τ) {x X φ(x) = τ }. Then, for every τ (τ l, τ u ), (a) v 2 (v 1 (τ)) = τ, and

62 An Inverse Function Theorem for Optimal-Value Functions Suppose that there is an interval (τ l, τ u ) R {± } with (τ l, τ u ) R such that τ (τ l, τ u ) argmin P(ψ, φ, τ) {x X φ(x) = τ }. Then, for every τ (τ l, τ u ), (a) v 2 (v 1 (τ)) = τ, and (b) argmin P(ψ, φ, τ) argmin P(φ, ψ, v 1 (τ)) {x X φ(x) = v 1 (τ)}.

63 An Inverse Function Theorem for Optimal-Value Functions Suppose that there is an interval (τ l, τ u ) R {± } with (τ l, τ u ) R such that τ (τ l, τ u ) argmin P(ψ, φ, τ) {x X φ(x) = τ }. Then, for every τ (τ l, τ u ), (a) v 2 (v 1 (τ)) = τ, and (b) argmin P(ψ, φ, τ) argmin P(φ, ψ, v 1 (τ)) {x X φ(x) = v 1 (τ)}. Moreover, where v 1 (v 2 (β)) = β for all β (β l, β u ), β l = inf {v 1 (τ) τ (τ l, τ u )} and β u = sup {v 1 (τ) τ (τ l, τ u )}, whenever (β l, β u ) {v 1 (τ) τ (τ l, τ u )}.

64 Optimization by Zero Finding The inverse function theorem gives conditions for v 1 (v 2 (β)) = β. Therefore, if we find a solution to τ of v 1 (τ) = β, then τ = v 2 (β)) and argmin P(ψ, φ, τ) argmin P(φ, ψ, v 1 ( τ)) = argmin P(φ, ψ, β).

65 Optimization by Zero Finding The inverse function theorem gives conditions for v 1 (v 2 (β)) = β. Therefore, if we find a solution to τ of v 1 (τ) = β, then τ = v 2 (β)) and argmin P(ψ, φ, τ) argmin P(φ, ψ, v 1 ( τ)) = argmin P(φ, ψ, β). The equation v 1 (τ) = β can be solved via an inexact secant method with the iterates converging at a super-linear rate.

Optimizaton and Kalman-Bucy Smoothing. The Chinese University of Hong Kong, March 4, 2016

Optimizaton and Kalman-Bucy Smoothing Aleksandr Y. Aravkin University of Washington sasha.aravkin@gmail.com James V. Burke University of Washington jvburke@uw.edu Bradley Bell University of Washington