Matrix Support Functional and its Applications

Size: px

Start display at page:

Download "Matrix Support Functional and its Applications"

Helena Wiggins
5 years ago
Views:

1 Matrix Support Functional and its Applications James V Burke Mathematics, University of Washington Joint work with Yuan Gao (UW) and Tim Hoheisel (McGill), CORS, Banff 2016 June 1, 2016

2 Connections What do the following topics have in common? Quadratic Optimization Problem with Equality Constraints The Matrix Fractional Function and its Generalization Ky Fan p-k Norms K-means Clustering Best Affine Unbiased Estimator Supervised Representation Learning Multi-task Learning Variational Gram Functions

3 Connections What do the following topics have in common? Quadratic Optimization Problem with Equality Constraints The Matrix Fractional Function and its Generalization Ky Fan p-k Norms K-means Clustering Best Affine Unbiased Estimator Supervised Representation Learning Multi-task Learning Variational Gram Functions Answer: They can all be represented using a matrix support function that is smooth on the interior of its domain.

4 A Matrix Support Functional (B-Hoheisel (2015)) Given A R p n and B R p m set D(A, B) := {( ) Y, 1YY T 2 R n m S n } Y R n m : AY = B

5 A Matrix Support Functional (B-Hoheisel (2015)) Given A R p n and B R p m set D(A, B) := {( ) Y, 1YY T 2 R n m S n } Y R n m : AY = B We consider the support functional for D(A, B). σ ((X, V ) D(A, B)) = sup (X, V ), (Y, 1YY T 2 ) AY =B

6 A Matrix Support Functional (B-Hoheisel (2015)) Given A R p n and B R p m set D(A, B) := {( ) Y, 1YY T 2 R n m S n } Y R n m : AY = B We consider the support functional for D(A, B). σ ((X, V ) D(A, B)) = sup (X, V ), (Y, 1YY T 2 ) AY =B = inf AY =B 1 tr (Y T 2 VY ) X, Y

7 Support Functions σ S (x) := σ (x S ) := sup x, y y S

8 Support Functions σ S (x) := σ (x S ) := sup x, y y S S = x {y x, y σ (x S )}

9 Support Functions σ S (x) := σ (x S ) := sup x, y y S S = x {y x, y σ (x S )} σ S = σ cl S = σ conv S = σ conv S

10 Support Functions σ S (x) := σ (x S ) := sup x, y y S S = x {y x, y σ (x S )} σ S = σ cl S = σ conv S = σ conv S When S is a closed convex set, then σ S (x) = arg max y S x, y.

11 Epigraph epi f := {(x, µ) f (x) µ} f (y) := σ ((y, 1) epi f )

12 A Representation for σ ((X, V ) D(A, B)) Let A R p n and B R p m such that rge B rge A. Then { ( (X 1 2 σ ((X, V ) D(A, B)) = tr ) T B M(V ) ( ) ) X B if rge ( X B) rge M(V ), V ker A 0, + else. where ( V A T M(V ) := A 0 ).

13 A Representation for σ ((X, V ) D(A, B)) Let A R p n and B R p m such that rge B rge A. Then { ( (X 1 2 σ ((X, V ) D(A, B)) = tr ) T B M(V ) ( ) ) X B if rge ( X B) rge M(V ), V ker A 0, + else. where In particular, ( V A T M(V ) := A 0 ). dom σ ( D(A, B)) = dom σ ( D(A, B)) { ( ) } X = (X, V ) R n m S n rge rge M(V ), V ker A 0, B with int (dom σ ( D(A, B))) = {(X, V ) R n m S n V ker A 0}.

14 A Representation for σ ((X, V ) D(A, B)) Let A R p n and B R p m such that rge B rge A. Then { ( (X 1 2 σ ((X, V ) D(A, B)) = tr ) T B M(V ) ( ) ) X B if rge ( X B) rge M(V ), V ker A 0, + else. where In particular, ( V A T M(V ) := A 0 ). dom σ ( D(A, B)) = dom σ ( D(A, B)) { ( ) } X = (X, V ) R n m S n rge rge M(V ), V ker A 0, B with int (dom σ ( D(A, B))) = {(X, V ) R n m S n V ker A 0}. The inverse M(V ) 1 exists when V ker A 0 and A is surjective.

15 Relationship to Equality Constrained QP Consider a equality constrained QP: ν(x, V ) := inf { 1 u R n 2 ut Vu x T u Au = b }.

16 Relationship to Equality Constrained QP Consider a equality constrained QP: ν(x, V ) := inf { 1 u R n 2 ut Vu x T u Au = b }. The Lagrangian is L(u, λ) = 1 2 ut Vu x T u + λ T (Au b). Optimality conditions are Vu + A T λ x = 0 Au = b

17 Relationship to Equality Constrained QP Consider a equality constrained QP: ν(x, V ) := inf { 1 u R n 2 ut Vu x T u Au = b }. The Lagrangian is L(u, λ) = 1 2 ut Vu x T u + λ T (Au b). Optimality conditions are Vu + A T λ x = 0 Au = b This is equivalent to ( V A T A 0 ) ( ) u = λ ( ) x. b

18 Relationship to Equality Constrained QP Consider a equality constrained QP: ν(x, V ) := inf { 1 u R n 2 ut Vu x T u Au = b }. The Lagrangian is L(u, λ) = 1 2 ut Vu x T u + λ T (Au b). Optimality conditions are Vu + A T λ x = 0 Au = b This is equivalent to M(V ) = ( V A T A 0 ) ( ) u = λ ( ) x. b

19 Relationship to Equality Constrained QP Consider a equality constrained QP: ν(x, V ) := inf { 1 u R n 2 ut Vu x T u Au = b }. The Lagrangian is L(u, λ) = 1 2 ut Vu x T u + λ T (Au b). Optimality conditions are This is equivalent to Hence M(V ) = Vu + A T λ x = 0 Au ( V A T A 0 = b ) ( ) u = λ ( ) x. b ν(x, V ) = σ ((x, V ) D(A, b)).

20 Maximum Likelihood Estimation L(µ, Σ; Y ) := (2π) mn/2 Σ N/2 N i=1 exp((y i µ) T Σ 1 (y i µ)) Up to a constant, the negative log-likelihood is ) ln L(µ, Σ; Y ) = 1 ln det Σ + 1tr 2 2 ((Y M) T Σ 1 (Y M) = σ ((Y M), Σ) D(0, 0)) 1 2 ( ln det Σ).

21 Maximum Likelihood Estimation L(µ, Σ; Y ) := (2π) mn/2 Σ N/2 N i=1 exp((y i µ) T Σ 1 (y i µ)) Up to a constant, the negative log-likelihood is ) ln L(µ, Σ; Y ) = 1 ln det Σ + 1tr 2 2 ((Y M) T Σ 1 (Y M) = σ ((Y M), Σ) D(0, 0)) 1 2 ( ln det Σ). The Matrix Fractional Function:Take A = 0 and B = 0, and set γ(x, V ) := σ ((x, V ) D(0, 0)) = { 1 2 V X if rge X rge V, V 0, + else.

22 conv D(A, B) Recall σ C (x) = {y conv C x, y = σ C (x)}.

23 conv D(A, B) Recall σ C (x) = {y conv C x, y = σ C (x)}. For σ ((X, V ) D(A, B)) we need conv (D(A, B)).

24 conv D(A, B) Recall σ C (x) = {y conv C x, y = σ C (x)}. For σ ((X, V ) D(A, B)) we need conv (D(A, B)). Set { } S n +(ker A) := W S n u T Wu 0 u ker A = {W ker A 0}. Then S n +(ker A) is a closed convex cone whose polar is given by S n +(ker A) = {W S n W = PWP 0}, where P is the orthogonal projection onto ker A.

25 conv D(A, B) Recall σ C (x) = {y conv C x, y = σ C (x)}. For σ ((X, V ) D(A, B)) we need conv (D(A, B)). Set { } S n +(ker A) := W S n u T Wu 0 u ker A = {W ker A 0}. Then S n +(ker A) is a closed convex cone whose polar is given by S n +(ker A) = {W S n W = PWP 0}, where P is the orthogonal projection onto ker A. For D(A, B) := {( Y, 1 2 YY T ) R n m S n Y R n m : AY = B }, := { (Y, W ) R n m S n conv D(A, B) = Ω(A, B) AY = B and 1 2 YY T + W S n +(ker A) }.

26 Applications Quadratic Optimization Problem with Equality Constraints The Matrix Fractional Function and its Generalization Ky Fan p-k Norms K-means Clustering Best Affine Unbiased Estimator Supervised Representation Learning Multi-task Learning Variational Gram Functions

27 Motivating Examples Recall the matrix fractional function γ(x, V ) := σ ((X, V ) D(0, 0)) { 1 = tr (X T 2 V X ) if rge X rge V, V 0, + else.

28 Motivating Examples Recall the matrix fractional function γ(x, V ) := σ ((X, V ) D(0, 0)) { 1 = tr (X T 2 V X ) if rge X rge V, V 0, + else. We have the following two representations of the nuclear norm: X = min γ(x, V ) + 1 V 2 tr V 1 2 X 2 = min γ(x, V ) + δ (V tr(v ) 1), V

29 Infimal Projections For a closed proper convex function h, define the infimal projection: ϕ(x ) := inf V σ ((X, V ) D(A, B)) + h(v ).

30 Infimal Projections For a closed proper convex function h, define the infimal projection: ϕ(x ) := inf V σ ((X, V ) D(A, B)) + h(v ). Theorem If dom h S n ++(ker A), then ϕ (Y ) = inf {h ( W ) (Y, W ) conv D(A, B)}.

31 Infimal Projections with Indicators When h is an indicator of a closed convex set V, then ϕ V (X ) := inf σ ((X, V ) D(A, B)), V V ϕ V(Y ) = 1 2 σ (YY T {V V V ker A 0} ) + δ (Y AY = B ) = 1σ 2 (YY T ) V S n +(ker A) + δ (Y AY = B )

32 Infimal Projections with Indicators When h is an indicator of a closed convex set V, then ϕ V (X ) := inf σ ((X, V ) D(A, B)), V V ϕ V(Y ) = 1 2 σ (YY T {V V V ker A 0} ) + δ (Y AY = B ) = 1σ 2 (YY T ) V S n +(ker A) + δ (Y AY = B ) Note that when B = 0, both ϕ V and ϕ V are positively homogeneous of degree 2. When A = 0 and B = 0, ϕ V is called a variational Gram functionin Jalali-Xiao-Fazel (2016?).

33 Ky Fan (p,k) norm For p 1, 1 k min{m, n}, the Ky Fan (p,k)-norm of a matrix X R n m is given by ( k ) 1/p X p,k = σ p i, where σ i are the singular values of X sorted in nonincreasing order. i=1

34 Ky Fan (p,k) norm For p 1, 1 k min{m, n}, the Ky Fan (p,k)-norm of a matrix X R n m is given by ( k ) 1/p X p,k = σ p i, where σ i are the singular values of X sorted in nonincreasing order. i=1 The Ky Fan (p, min{m, n})-norm is the Schatten-p norm. The Ky Fan (1, k)-norm is the standard Ky Fan k-norm.

35 Ky Fan (p,k) norm For p 1, 1 k min{m, n}, the Ky Fan (p,k)-norm of a matrix X R n m is given by ( k ) 1/p X p,k = σ p i, where σ i are the singular values of X sorted in nonincreasing order. i=1 The Ky Fan (p, min{m, n})-norm is the Schatten-p norm. The Ky Fan (1, k)-norm is the standard Ky Fan k-norm. Corollary 1 2 X 2 2p = inf γ(x, V ).,min{m,n} p+1 V p,min{m,n} 1

36 Ky Fan (p,k) norm Corollary 1 2 X 2 2p = inf γ(x, V ).,min{m,n} p+1 V p,min{m,n} 1 γ(x, V ) = σ ((X, V ) D(0, 0)), ϕ V (X ) := inf σ ((X, V ) D(A, B)) V V Proof. and γ V (X ) := inf σ ((X, V ) D(0, 0)) V V ( inf γ(x, V ) V p,min{m,n} 1 ) = σ ( 1 2 XX T { V 0 V p,min{m,n} 1 }) = 1 2 XX T p p 1,min{m,n} = 1 2 X 2 2p p+1,min{m,n}.

37 Ky Fan (p,k) norm As a special case when p = 1, Corollary 1 X 2 2 = min tr V 1 γ(x, V ). Lemma Let V to be the set of rank-k orthogonal projection matrices V = { UU T U R n k, U T } U = I k, then 1 X 2 2 2,k = σ ( 1 V XX T ) 2. Proof. A consequence of the following fact [Fillmore-Williams 1971]: conv { UU T U R n k, U T U = I k } = {V S n I V 0, tr V = k }.

38 K-means Clustering [Zha-He-Ding-Gu-Simon 2001] Consider X R n m, the k-means objective is K(X ) := min C,E 1 X 2 EC 2 2, where C R k m represents the k centers, and E is a n k matrix where each row is one of e1 T,..., et k which correspond to the k cluster assignments. The optimal C is given by C = (E T E) 1 E T X. Define P E = E(E T E) 1 E T, then P E K(X ) = min E 1 (I P 2 E)X 2 2 = 1 min 2 = 1 2 X 2 2 σ Pk ( 1 2 XX T ) 1 2 is an orthogonal projection. ( ) tr (I P E )XX T E ( X 2 2 X 2 ) 2,k.

39 Best Affine Unbiased Estimator For a linear regression model y = A T β + ɛ where ɛ N (0, σ 2 V ), and a given matrix B, an affine unbiased estimator of B T β is an estimator of the form ˆθ = X T y + c satisfying Eˆθ = B T β. Best: Var(ˆθ ) Var(ˆθ), ˆθ

40 Best Affine Unbiased Estimator For a linear regression model y = A T β + ɛ where ɛ N (0, σ 2 V ), and a given matrix B, an affine unbiased estimator of B T β is an estimator of the form ˆθ = X T y + c satisfying Eˆθ = B T β. Best: Var(ˆθ ) Var(ˆθ), ˆθ If a solution to v(a, B, V ) := exists and unique, then ˆθ = (X ) T y. 1 min tr X T 2 V X. X :AX =B v(a, B, V ) = σ D(A,B) (0, V ). The optimal solution X satisfies ( ) X M(V ) = W ( ) 0. B

41 Supervised Representation Learning Consider a binary classification problem where we are given the training data: (x 1, y 1 ),..., (x n, y n ) R m { 1, 1}, and test data: x n+1,..., x n+t R m. Representation learning aims to learn a feature mapping Φ : R m H that maps the data points to a feature space where points between the two classes are well separated. Kernel Methods: Instead of specifying the function Φ explicitly, kernel methods consider mapping the data points to a reproducing kernel Hilbert space H so that the kernel matrix K S n+t +, where K ij = Φ(x i ), Φ(x j ), implicitly determines the mapping Φ.

42 Supervised Representation Learning Let K S n+t + be a set of candidate kernels. The best K K can be selected by maximizing its alignment with the kernel specified by the training labels: ϕ 1 V(y) = max 2 K V K 1:n,1:n, yy T, where [ ] V = K B 2 S n 0n n 0 +, A =, and B = 0 0 I (n+t) 1. t t

43 Multi-task Learning In multi-task learning, T sets of labelled training data (x t1, y t1 ),..., (x tn, y tn ) R m R are given, representing T learning tasks. Assumption: A linear feature map h i (x) = u i, x, i = 1,..., m, where U = (u 1,..., u m ) is an m m orthogonal matrix, and the predictor for each task is f t (x) := a t, h(x). The multi-task learning problem is then min T m A,U t=1 i=1 L ( t yti, a t, U T x ti ) + µ A 2 2,1, where A = (a 1,..., a T ), A 2 2,1 is the square of the sum of the 2-norm of the rows of A, and L t is a loss function for each task. Denote W = UA, then the nonconvex problem is equivalent to the following convex problem [Argyriou-Evgeniou-Pontil 2006]: min W,D T t=1 m i=1 L (y ti, w t, x ti ) + 2µγ(W, D) s.t. tr D 1. It is equivalent to min W T t=1 m i=1 L (y ti, w t, x ti ) + µ W 2.

44 Thank you!

45 References I J. V. Burke and T. Hoheisel: Matrix Support Functionals for Inverse Problems, Regularization, and Learning. SIAM Journal on Optimization, 25(2): , 2015.

Y R n m : AY = B. McGill University, 805 Sherbrooke St West, Room1114, Montréal, Québec, Canada H3A 0B9

Y R n m : AY = B. McGill University, 805 Sherbrooke St West, Room1114, Montréal, Québec, Canada H3A 0B9 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 20 2 22 23 24 25 26 27 28 29 30 3 32 33 34 35 36 37 38 VARIATIONAL PROPERTIES OF MATRIX FUNCTIONS VIA THE GENERALIZED MATRIX-FRACTIONAL FUNCTION JAMES V. BURKE, YUAN GAO,